Support Vector Machines: A Concise Technical Overview
Support Vector Machines remain a popular and timetested classification algorithm. This post provides a highlevel concise technical overview of their functionality.
Classification is concerned with building a model that separates data into distinct classes. This model is built by inputting a set of training data for which the classes are prelabeled in order for the algorithm to learn from. The model is then used by inputting a different dataset for which the classes are withheld, allowing the model to predict their class membership based on what it has learned from the training set. Wellknown classification schemes include decision trees and Support Vector Machines, among a whole host of others. As this type of algorithm requires explicit class labeling, classification is a form of supervised learning.
As mentioned, Support Vector Machines (SVMs) are a particular classification strategy. SMVs work by transforming the training dataset into a higher dimension, which is then inspected for the optimal separation boundary, or boundaries, between classes. In SVMs, these boundaries are referred to as hyperplanes, which are identified by locating support vectors, or the instances that most essentially define classes, and their margins, which are the lines parallel to the hyperplane defined by the shortest distance between a hyperplane and its support vectors. Consequently, SVMs are able to classify both linear and nonlinear data.
The grand idea with SVMs is that, with a high enough number of dimensions, a hyperplane separating a particular class from all others can always be found, thereby delineating dataset member classes. When repeated a sufficient number of times, enough hyperplanes can be generated to separate all classes in ndimensional space. Importantly, SVMs look not just for any separating hyperplane but the maximummargin hyperplane, being that which resides equidistance from respective class support vectors.
Fig.1: MaximumMargin Hyperplane and the Support Vectors.
When data is linearlyseparable, there are many separating lines that could be chosen. Such a hyperplane can be expressed as
where W is a vector of weights, b is a scalar bias, and X are the training data (of the form (x_{1}, x_{2} )). If our bias, b, is thought of as an additional weight, the equation can be expressed as
which can, in turn, be rewritten as a pair of linear inequalities, solving to greater or less than zero, either of which satisfied indicate that a particular point lies above or below the hyperplane, respectively. Finding the maximummargin hyperplane, or the hyperplane that resides equidistance from the support vectors, is done by combining the linear inequalities into a single equation and transforming them into a constrained quadratic optimization problem, using a Lagrangian formulation and solving using KarushKuhnTucker conditions.
Following this transformation, the maximummargin hyperplane can be expressed in the form
where b and α_{i} are learned parameters, n is the number of support vectors, i is a support vector instance, t is the vector of training instances, y_{i} is the class value of a particular training instance of vector t, and a(i) is the vector of support vectors. Once the maximummargin hyperplane is identified and training is complete, only the support vectors are relevant to the model as they define the maximummargin hyperplane; all of the other training instances can be ignored.
When data is not linearlyseparable, the data is first transformed into a higher dimensional space using some function, and this space is then searched for the hyperplane, another quadratic optimization problem. In order to circumvent the additional computational complexity that increased dimensionality brings, all calculations can be performed on the original input data, which is quite likely of lower dimensionality. This higher computational complexity is based on the fact that, in higher dimensionality, the dot product of an instance and each of the support vectors would need to be calculated for each instance classification. A kernel function, which maps an instance to feature space created by a particular function it applies instances to, is used on the original, lowerdimensionality data. A common kernel function is the polynomial kernel, which expressed
computes the dot product of 2 vectors, and raises that result to the power of n.
SVMs are powerful, wellused classification algorithms, which garnered a substantial amount of research attention prior to the deep learning boom we are currently in. Despite the fact that, to the onlooker, they may no longer be bleeding edge algorithms, they certainly have had great success in particular domains, and remain some of the most popular classification algorithms in the toolkits of machine learning practitioners and data scientists.
What has been presented above is a very highlevel overview of SVMs. Support Vector Machines are a complex classifier, more of which can be found by starting here.
Related:
Top Stories Past 30 Days

