What is a Support Vector Machine, and Why Would I Use it?

Support Vector Machine has become an extremely popular algorithm. In this post I try to give a simple explanation for how it works and give a few examples using the the Python Scikits libraries.

This post originally appeared on the Yhat blog. Yhat is a Brooklyn based company whose goal is to make data science applicable for developers, data scientists, and businesses alike. Yhat provides a software platform for deploying and managing predictive algorithms as REST APIs, while eliminating the painful engineering obstacles associated with production environments like testing, versioning, scaling and security.

What is SVM?

SVM is a supervised machine learning algorithm which can be used for classification or regression problems. It uses a technique called the kernel trick to transform your data and then based on these transformations it finds an optimal boundary between the possible outputs. Simply put, it does some extremely complex data transformations, then figures out how to seperate your data based on the labels or outputs you've defined.

So what makes it so great?

Well SVM it capable of doing both classification and regression. In this post I'll focus on using SVM for classification. In particular I'll be focusing on non-linear SVM, or SVM using a non-linear kernel. Non-linear SVM means that the boundary that the algorithm calculates doesn't have to be a straight line. The benefit is that you can capture much more complex relationships between your datapoints without having to perform difficult transformations on your own. The downside is that the training time is much longer as it's much more computationally intensive.

Cows and Wolves

So what is the kernel trick?

The kernel trick takes the data you give it and transforms it. In goes some great features which you think are going to make a great classifier, and out comes some data that you don't recognize anymore. It is sort of like unraveling a strand of DNA. You start with this harmelss looking vector of data and after putting it through the kernel trick, it's unraveled and compounded itself until it's now a much larger set of data that can't be understood by looking at a spreadsheet. But here lies the magic, in expanding the dataset there are now more obvious boundaries between your classes and the SVM algorithm is able to compute a much more optimal hyperplane.

For a second, pretend you're a farmer and you have a problem--you need to setup a fence to protect your cows from packs of wovles. But where do you build your fence? Well if you're a really data driven farmer one way you could do it would be to build a classifier based on the position of the cows and wolves in your pasture. Racehorsing a few different types of classifiers, we see that SVM does a great job at seperating your cows from the packs of wolves. I thought these plots also do a nice job of illustrating the benefits of using a non-linear classifiers. You can see the the logistic and decision tree models both only make use of straight lines.

Want to recreate the analysis?

Want to create these plots for yourself? You can run the code in your terminal or in an IDE of your choice, but, big surprise, I'd recommend Rodeo. It has a great pop-out plot feature that comes in handy for this type of analysis. It also ships with Python already included for Windows machines. Besides that, it's now lightning fast thanks to the hard work of TakenPilot.

Once you've downloaded Rodeo, you'll need to save the raw cows_and_wolves.txt file from my github. Make sure you've set your working directory to where you saved the file.

Alright, now just copy and paste the code below into Rodeo, and run it, either by line or the entire script. Don't forget, you can pop out your plots tab, move around your windows, or resize them.

Let SVM do the hard work

In the event that the relationship between a dependent variable and independent variable is non-linear, it's not going to be nearly as accurate as SVM. Taking transformations between variables (log(x), (x^2)) becomes much less important since it's going to be accounted for in the algorithm. If you're still having troubles picturing this, see if you can follow along with this example.

Let's say we have a dataset that consists of green and red points. When plotted with their coordinates, the points make the shape of a red circle with a green outline (and look an awful lot like Bangladesh's flag).

What would happen if somehow we lost 1/3 of our data. What if we couldn't recover it and we wanted to find a way to approximate what that missing 1/3 looked like.

So how do we figure out what the missing 1/3 looks like? One approach might be to build a model using the 80% of the data we do have as a training set. But what type of model do we use? Let's try out the following:

Logistic model
Decision Tree
SVM

I trained each model and then used each to make predictions on the missing 1/3 of our data. Let's take a look at what our predicted shapes look like...

Follow along

Here's the code to compare your logistic model, decision tree and SVM.

Follow along in Rodeo by copying and running the code above!

Results

From the plots, it's pretty clear that SVM is the winner. But why? Well if you look at the predicted shapes of the decision tree and GLM models, what do you notice? Straight boundaries. Our input model did not include any transformations to account for the non-linear relationship between x, y, and the color. Given a specific set of transformations we definitely could have made GLM and the DT perform better, but why waste time? With no complex transformations or scaling, SVM only misclassified 117/5000 points (98% accuracy as opposed to DT-51% and GLM-12%! Of those all misclassified points were red--hence the slight bulge.

When not to use it

So why not use SVM for everything? Well unfortunately the magic of SVM is also the biggest drawback. The complex data transformations and resulting boundary plane are very difficult to interpret. This is why it's often called a black box. GLM and decision trees on the contrary are exactly the opposite. It's very easy to understand exactly what and why DT and GLM are doing at the expense of performance.

More Resources

Want to know more about SVM? Here's a few good resources I've come across:

Beginner SVM Tutorial: Just the basics with a little bit of spoon feeding from Zoya Gavrilov at MIT
Beginner How SVM algorithm works: Video by Thales Sehn Körting
Intermediate A Gentle Introduction to Support Vector Machiens in Biomedicine Slides from NYU & Vanderbilt
Advanced Tutorial on Support Vector Machines for Pattern Recognition from Christopher Burges at Bell Labs

Original. Reposted with permission.

Related: