Spark for Scale: Machine Learning for Big Data

This post discusses the fundamental concepts for working with big data using distributed computing, and introduces the tools you need to build machine learning models.

Setting up your development environment

As mentioned above, Spark is written in Scala and hence uses Java’s runtime platform for executing the programs. The first step would be to download and install all the dependencies required for our environment.

Here is a list of things that you need to download and install:

In case you noticed, the Spark code contains references to a lot of Hadoop classes. This is because Spark does use Hadoop in some ways. For example, Hadoop’s Distributed File System can be used as an input source of data in Spark. This makes working with the two technologies even more convenient.

If this is the first time you are using Scala, you might not be familiar with the Simple Build Tool (SBT). SBT is a build and integration tool for JVM-based programming languages. It allows you to structure your projects, build them, fetch dependencies, and many other things that you would expect from a good build tool. While Spark uses Maven for production builds, it is easier to use SBT for standalone Spark programs.

Working with Apache Spark

There are many ways to use Apache Spark in your projects, but the easiest way is to launch the interactive Scala console with Spark. To do that, you simply have to navigate to the directory containing the Spark extracts and run the shell from the bin directory.

This will launch an interactive shell for Spark in Scala. This will also create an object named <sc> , which stands for Spark Context. Spark Context (SC) is the object that is used to access the Spark cluster.

Alternatively, you can also use a Python shell called Pyspark. While Python is a great programming language for exploratory data analysis, we will be using Scala for the purpose of this tutorial.

Another great way is to write stand-alone Spark applications by including Spark as a dependency in your SBT file.

Understanding RDD(s)

The basic data structure of Spark is an RDD, which stands for Resilient Distributed Dataset. As the name suggests, an RDD is resilient and distributed by nature. In case a compute node crashes, Spark will try to recompute that state on another machine by replaying the same set of actions.

You can think of RDDs as a list that is accessed in parallel instead of sequentially. If you have ever programmed in a functional programming language, this will all seem very natural to you.

Word count example (using Spark)

Now that we have the basics clear, let’s try to implement our distributed word count algorithm on Spark. Since both Spark and Scala have functional semantics, we can easily do this in just three lines of code.

While the above code looks suspiciously precise, I assure you it does exactly what we expect it to do. Let’s now go a step further and try to understand the above code does.

The first thing we did is read the text file that we want to process by calling the <textFile> method on SC. Next, we split each line by spaces to extract words. This is done using the <flatMap> method that maps over every single line in the file to extract words and then flattens the output into a single line of words.

Once we had all the words, we mapped them all into tuples of key-value pairs containing the word and the integer value 1. This was then reduced using the key (i.e. the word) by summing the left and right operands into one. This way, we summed all the 1s for a word, as described in the algorithm early in the article.

At the end, we get the result by calling the <collect> method on the output and printing the result of each word, line by line.

Working with DataFrames

Apart from lower level RDDs, Spark also uses DataFrames as the higher level technique for accessing structured data.

DataFrames, at the time of this writing, are an abstraction over RDDs that fit in the tabular data. DataFrames have an API for executing SQL on these tables while also providing some higher level helper functions to work with this data.

DataFrames are supposed to store tabular data, so it makes sense to use tabular data processing and querying techniques to work with them. One of the great advantages of using Spark DataFrames is that you can use SQL syntax to work with the data in the frames. Let’s register a table on our frame so we can use SQL to query the data.

Once a frame is registered as a table, you can use SQL to run your queries. Let’s use SQL to draw some basic insights from the data, trying to stick to things that we did in our previous Titanic blog.

You might have noticed that we call the “show” method on each query result. “Show” is a quick and handy way to get the top output from your data without going through all of it.

To get a more comprehensive understanding of the DataFrames API, you should refer to the official Spark documentation.