Spark for Scale: Machine Learning for Big Data

This post discusses the fundamental concepts for working with big data using distributed computing, and introduces the tools you need to build machine learning models.

By SocialCops.

Machine learning header

Recently we shared an introduction to machine learning. While making machines learn from data is fun, the data from real-world scenarios often gets out of hand if you try to implement traditional machine-learning techniques on your computer. To actually use machine learning with big data, it’s crucial to learn how to deal with data that is too big to store or compute on a single computing machine.

Today we will discuss fundamental concepts for working with big data using distributed computing, then introduce the tools you need to build machine learning models. We’ll start with some naive methods of solving problems, which are meant only as an example. As we move forward, we will make things more realistic.

Understanding the idea behind MapReduce

MapReduce is a technique that is used to distribute a data set in parts to different agents. An agent here means a single computer. The idea is to do operations on parts of your data, then combine that data after the operations are done to find the insights you need.

Imagine that you run the security department of a large shopping mall and you need to know how many people entered the mall for a particular day. You ask all the guards at different gates to take a count of every person coming in the mall. When the gates close at the end of the day, you, as the supervisor, go to each guard to get their count. Now you have the total number of people who came to the mall that day.

MapReduce is very similar — the idea is to break down the computing problem and give it to multiple workers (i.e. computers). When each computer is done computing its part, all the parts are reduced into a single result by a master (i.e. another computer managing the workers).

Example: finding the average age for a table of names

Let’s now look at an example from a data engineer’s perspective. Imagine we have a table where every row contains a person’s details — phone number, email address, country, birth date, etc. Now we want to find the average age for each name in our table.

We used this technique to analyze voter roll data for our collaboration with The Hindu. Check out the first part and second part of the story.

We can map on our table to extract a person’s name and age, and reduce the output to reflect only the average age for a particular name.

Here are the steps that our algorithm has to follow to get the results we want:

  1. Take the name and the age of every person out of the table.
  2. Count the number of times each name comes up, and sum the corresponding ages for each name.
  3. Divide the summed ages for each name by the number of people with that name to get the average age.

Assuming that we want this algorithm to run on 5 computers, we will split our table into five parts and give each part to one computer. Each computer will create a new table with three columns — name, sum of age, and count. The computer will check each row in the original table. If it hasn’t seen the name in that row, the computer will create an entry for that name in the new table. If the computer has seen that name before, it will add that age to the summed age and increment the count in the new table. Once each computer has finished computing its parts, the computers will start combining their tables, adding the sum and the count. Then they can divide the count by the sum to get the average age for every name.

Note: This method would be a naive implementation of the MapReduce approach. It’s only meant as an introductory example.

Example: the obligatory word count example

In our second example, let’s count the number of times a word appears in a file. The problem is that our file is so huge that it can neither be stored in a single computer’s memory. However, while the file may be huge, the words it contains may be highly redundant. Thus, it’s possible that the table that we make of words and their counts will fit on a single machine.

Let’s assume that we are splitting our huge file (line by line) among 20 computers (i.e. workers), which are then supposed to count the number of times each word comes up in the file.

Each of our 20 workers will now split the file line by line, and further split the lines by spaces to get a list of all the words in the file. Now we have one big table split across multiple workers with only one column — a word.

Now each worker will create a new table with two columns — the word and its count — and go through our first table one row at a time. The worker will increment the count of the word in the new table by one if it already exists in the table, or put in a new word with a count of one if it does not.

Once all the workers are finished doing their part, a master will come into the picture. It will start reducing the resulting tables by all the workers into a single table containing each word and its count.

If all of this feels daunting, don’t panic. Things will get clearer once we start making these things work in practical programming environments.

Note: If you think that the resulting table at the end of all the MapReduce process will still be too big to store on a computer, don’t worry. We’ve mentioned a solution in the next section to solve that problem.

Introduction to Hadoop and Spark

Companies like Google that have been working with big data for a long time were the first companies to start thinking about distributing computing problems across multiple hardware resources. In 2003, Google published a research paper describing Google File System, which is a system for storing files on multiple storage devices across a data center by splitting them down into multiple parts which are stored on different storage devices. These part files were also replicated so that if one storage device crashed, you could recover the lost data by creating another copy from the replica.

The concept and the algorithms were quickly picked up by the community, and soon Hadoop File System (HDFS) was produced. HDFS was an open-source design of Google’s Distributed File System that enabled everyone in the world to host their files across different storage devices spread across what is usually a data center.

However, the mere storage of the data across various nodes was not enough for the software industry. This was the time when we were already hitting data processing limits and the open source world needed something to perform complex computations on the data across multiple storage devices. Thus, with HDFS the concept of MapReduce was introduced.

MapReduce was a concept implemented in Java, and a Java API was made available. Developers were now able to write the map and reduce function, select the source of data from their HDFS installation, and run their MapReduce operations to get the insights they needed from the data.

HDFS and MapReduce dominated the big data market for a long time until Matei Zahira, a Ph.D. student at the University of California at Berkeley, published his research project called Spark.

Backed by the Apache Software Foundation, Spark soon became one of the top Apache projects and gained massive traction in the market. Spark claimed it led to a 10 to 100 times speed increase over Hadoop’s MapReduce ecosystem. This was achieved by making many optimizations on the MapReduce design. Written in Scala — a modern hybrid (functional and object oriented) programming language — Spark would try to minimize the disk operations and try to keep the data in memory as much as possible.

While Hadoop provided an API for Java, Spark came with an API for Scala (which was way more user friendly), a Java API that was still superior to that of Hadoop, and an almost-as-good Python API. Python was already the programming language loved by data scientists. On the other hand, Scala, because of its functional properties, provided developers high flexibility and ease for quickly working out their solutions. Further, the speed of Spark’s programs allowed for interactive data analysis since people did not have to wait hours for their programs to finish executing.

In this article, we will be exploring Apache Spark and learn how to get basic utility out of it.