Yahoo! CaffeOnSpark: Distributed Deep Learning on Big Data Clusters
Get an overview of Yahoo!'s CaffeOnSpark, the latest entrant into the world of distributed deep learning, directly from the developers.
Deep learning (DL) is a critical capability required by Yahoo product teams (ex. Flickr, Image Search) to gain intelligence from massive amounts of online data. Many existing DL frameworks require a separated cluster for deep learning, and multiple programs have to be created for a typical machine learning pipeline (see Figure 1). The separated clusters require large datasets to be transferred among them, and introduce unwanted system complexity and latency for end-to-end learning.
As discussed in our earlier Tumblr post, we believe that deep learning should be conducted in the same cluster along with existing data processing pipelines to support feature engineering and traditional (non-deep) machine learning. We created CaffeOnSpark to allow deep learning training and testing to be embedded into Spark applications (see Figure 2).
CaffeOnSpark: API & Configuration and CLI
CaffeOnSpark is designed to be a Spark deep learning package. Spark MLlib supported a variety of non-deep learning algorithms for classification, regression, clustering, recommendation, and so on. Deep learning is a key capacity that Spark MLlib lacks currently, and CaffeOnSpark is designed to fill that gap. CaffeOnSpark API supports dataframes so that you can easily interface with a training dataset that was prepared using a Spark application, and extract the predictions from the model or features from intermediate layers for results and data analysis using MLLib or SQL.
Scala program in Figure 4 illustrates how CaffeOnSpark and MLlib work together:
- L1-L4 … You initialize a Spark context, and use it to create CaffeOnSpark and configuration object.
- L5-L6 … You use CaffeOnSpark to conduct DNN training with a training dataset on HDFS.
- L7-L8 …. The learned DL model is applied to extract features from a feature dataset on HDFS.
- L9-L12 … MLlib uses the extracted features to perform non-deep learning (more specifically logistic regression for classification).
- L13 … You could save the classification model onto HDFS.
As illustrated in Figure 4, CaffeOnSpark enables deep learning steps to be seamlessly embedded in Spark applications. It eliminates unwanted data movement in traditional solutions (as illustrated in Figure 1), and enables deep learning to be conducted on big-data clusters directly. Direct access to big-data and massive computation power are critical for DL to find meaningful insights in a timely manner.
CaffeOnSpark uses the configuration files for solvers and neural network as in standard Caffe uses. As illustrated in our example, the neural network will have a MemoryData layer with 2 extra parameters:
- source_class specifying a data source class
- source specifying dataset location
The initial CaffeOnSpark release has several built-in data source classes (including com.yahoo.ml.caffe.LMDB for LMDB databases and com.yahoo.ml.caffe.SeqImageDataSource for Hadoop sequence files). Users could easily introduce customized data source classes to interact with the existing data formats.
CaffeOnSpark applications will be launched by standard Spark commands, such as spark-submit. Here are 2 examples of spark-submit commands. The first command uses CaffeOnSpark to train a DNN model saved onto HDFS. The second command is a custom Spark application that embedded CaffeOnSpark along with MLlib.
Figure 5 describes the system architecture of CaffeOnSpark. We launch Caffe engines on GPU devices or CPU devices within the Spark executor, via invoking a JNI layer with fine-grain memory management. Unlike traditional Spark applications, CaffeOnSpark executors communicate to each other via MPI allreduce style interface via TCP/Ethernet or RDMA/Infiniband. This Spark+MPI architecture enables CaffeOnSpark to achieve similar performance as dedicated deep learning clusters.
Many deep learning jobs are long running, and it is important to handle potential system failures. CaffeOnSpark enables training state being snapshotted periodically, and thus we could resume from previous state after a failure of a CaffeOnSpark job.
In the last several quarters, Yahoo has applied CaffeOnSpark on several projects, and we have received much positive feedback from our internal users. Flickr teams, for example, made significant improvements on image recognition accuracy with CaffeOnSpark by training with millions of photos from the Yahoo Webscope Flickr Creative Commons 100M dataset on Hadoop clusters.
CaffeOnSpark is beneficial to deep learning community and the Spark community. In order to advance the fields of deep learning and artificial intelligence, Yahoo is happy to release CaffeOnSpark at github.com/yahoo/CaffeOnSpark under Apache 2.0 license.
CaffeOnSpark can be tested on an AWS EC2 cloud or on your own Spark clusters. Please find the detailed instructions at Yahoo github repository, and share your feedback at email@example.com. Our goal is to make CaffeOnSpark widely available to deep learning scientists and researchers, and we welcome contributions from the community to make that happen.
Original. Reposted with permission.