Spark for Scale: Machine Learning for Big Data

This post discusses the fundamental concepts for working with big data using distributed computing, and introduces the tools you need to build machine learning models.

Machine learning for big data, using the Titanic data set

Machine learning in a distributed environment can be tricky if you are coming from a data science background, where specialized tools are used for this purpose. It is important that, while writing our Machine Learning algorithms, we optimize them to take advantage of the performance benefits offered by a distributed system.

While trying to understand Machine Learning on Spark, we will use the ML Library that ships with Spark. Spark’s default package ships with two modules —  and .

For the purpose of this article, we will be using the Random Forest algorithm to train a machine learning model to make predictions.

By now you have learned how to load a data frame into Spark and run some basic analytics on it using SQL. However, machines can not understand the data like we do, so it is important that we convert the data to a format that Spark can understand to build a machine learning model. This will require some basic data cleaning and transformation. Once we have that, all we need to do is call some library functions to build a model, test the model, and then evaluate the success percentage.

Let’s first go through the code and understand the comments, and then explain the functionality in detail. The code for the above is as follows:

The first few steps are easy — we import the functions and classes we will need in our program, and load the CSV file as a DataFrame in Spark.

What you see next is interesting. In the second step, we define two Scala case classes called Initial and Final, both representing the schema for the initial and final state of the data frame in the cleaning process. While there are many ways to do this, it’s interesting how you can use Scala data types to apply methods over a data set in Spark.

Next, we define an <assembler> , which is a library module provided by ML Spark library for converting numerical values in a DataFrame to a machine-understandable vector. A vector is a linear representation of the data in different dimensions. We will later use this assembler to extract the numerical values out of our data and convert them to a vector. However, before we can do that, we will need to remove the null values and convert any string values in our data to numerical values, which is exactly what our <autobot> function does.

In the next two steps, we run the autobot function and apply our assembler on the output to produce clean data, from which we then drop extra columns and rename the “survived” column to “label” column since that is what will be the label for our machine learning model.

Now we define a label and feature indexer, set the input and output column name, and fit the entire data set in this index model. The reason why we fit the entire data set, and not the training set, is that we want our model to have all the labels. It is likely that, while splitting the dataset into training and testing parts, we might lose certain labels for features in one of them while they might exist in the other. Using the entire data set allows us to define metadata on all the features and labels so that we can use the index when building the Machine Learning model.

Once we have the indexer, we will create a Random Forest model and set the names for input and output columns on the table we will use for training the model and predicting the labels. We also define a label converter before this so that, when we’re using the index that we created in the last step, we can extract the labels out of that index and put them in the predictions.

That, so far, was the difficult part. In the next steps, we build a pipeline that tells Spark the steps it must follow to create a Random Forest model from the training data, and run the pipeline to extract a model out of it.

Evaluating and testing the model

Once we have trained a model, the next step would be to test that model and evaluate the accuracy. A model can be used to transform a test data frame to produce predictions, as done in the next steps in the code above. The problem that we face is to easily test the accuracy of our model.

Fortunately, Spark comes with a concept called evaluators, which lets us test our test data against our model and gives us the accuracy of that model. In the last two steps of our example, we create an evaluator and print the output of that evaluator on the screen to get the accuracy probability.

Now that we have a model and its accuracy, we can test as many inputs against this model as we want. What would be nice here is a way to dig deeper inside our model and understand how exactly the decisions are being made. Not all machine learning and classification algorithms produce a model that is human readable; but, in our case, we can dig deeper because we are using the Random Forest classification algorithm.

All we need to do in is type cast our model so Scala can identify the model and call the  method on our model to print the decision making of the model on the console.

Note that we used Random Forest only for demonstration purposes. The accuracy of the predictions will, of course, depend on the model used. Random Forest Classification algorithm is just one of the many algorithms available for training a model. Spark provides many more algorithms and techniques for machine learning, classification, and regression. If you feel confident using Spark, you can even implement your own algorithms using the low-level Spark API.

The future of Spark

Along with other developments in the distributed systems community, Spark is growing very quickly and trying to solve many problems along the way.

There are many things that Spark has to work on. As of this writing, Spark community is working on integrating the project Tungsten with Spark internals to optimize the execution plan for Spark jobs. Other than that, the community is working on making the project easier to use and more accessible in general.

It looks like support for more machine learning techniques and better support for neural networks is the natural progression for Apache Spark. But only time will tell!

Where to go from here

While this tutorial was supposed to give you a quick introduction to Spark and distributed systems, there are still many things that you can learn about this domain. A great place to start is Stanford’s Mining Massive Datasets Project, which comes with a free ebook and an online course to learn about machine learning on distributed systems.

If you want to get involved with Spark, it is recommended that you know Scala to be able to understand how some of the core Spark modules work. Fortunately, there is a great free online program on Coursera, backed by the creators of Scala at EPFL.

If you want to learn more about Spark with the sole purpose of using it in your projects, Spark’s official documentation is a great place to start. To start playing around with Spark, you can refer to UCI Data Repository, which is a great source for freely available data sets used for machine learning.

In the end, the best way to learn is by trying. So download the latest edition of Spark, try it out, and explore the examples. All the best!

SocialCops is a data intelligence company that empowers organizations to tackle the world's most critical problems through data.

Original. Reposted with permission.