Simple and complete tutorial of Support Vector Machines
Table of content
- Understanding how you calculate distance to a point from a plane
- Maths behind how you find support vectors
- How does and why does a kernel work
- How to figure out if your support vector is good at generalizing
- Bias and variance tradeoff in support vector machines
- Imbalanced Classes In SVM
- Loss function in svm
- Assumptions under SVM and stuff to be careful about
SIMPLEST WAY TO IMAGINE SUPPORT VECTOR MACHINES
Take this plotted data for a 2 dimensional data with two different labels.
Now try to imagine a line somewhere in the middle that differentiates both the data sets.
I am pretty sure most of you can picture this line.
Our aim with SVM is to find this exact line. In more technical terms, we want a line or hyper plane so that the distance of the closest points in each group from the line/Hyperplane is the GREATEST.
Using machine learning jargon, we want a line/hyperplane with the maximum margin from the closest points.
Once we have this line, it acts as a classifier, all points to the right are red and all the points to the left are blue.
The same logic applies, if the data has more than 2 features. In those cases we will find the plane/hyperplane with the maximum margin from the closest points of each separate group.
COMPLETE EXPLANATION OF SUPPORT VECTOR MACHINES
In SVM we try to find a hyperplane in any n dimensional space, that divides two classes with the maximum margin. That is why SVM is also called a “maximum margin classifier”.
The word “Margin” is the super important to SVM.
Let’s get started on finding the plane with the biggest margin between the hyperplane and the points in our dataset.
Understanding how you calculate the distance between a point on a hyperplane. (MATHS TAKES EFFORT TO LEARN )
The first piece of maths that we need to understand for this algorithm , is how do we go about finding the distance (or margin) between a plane and a point.
For people who are completely unaware about linear algebra, please watch this video.
My pictures for this part are taken from the youtube course from caltech, CS 156 by Professor Yaser Abu-Mostafa.
For a given hyperplane
In vector form, wT X =b,
The vector normal to the point is given by w.
A simple proof of the equation is:-
If P and Q are two points on the plane with equation then wT P =b and wT Q=b, so
w T (Q – P) = d – d = 0.
This can be only possible if the dot product is equal to zero. So we know that the vector w is orthogonal to any vector PQ between points P and Q of the plane.
So, now that we know the equation of the line perpendicular to the plane
Now let’s calculate the distance between the closest point Xn and the plane.
This will be given by the projection of the vector (Xn-x) on the orthogonal vector w.
This will be given by the dot product of the vector Xn-x with the unit vector perpendicular to the plane, where x is a point on the plane.
Unit vector is basically a vector with magnitude equal to 1 and in the same direction as the original vector.
The distance is dependent on the coordinates of the point , as well as on w.
Let’s add b and subtract b from this equation
We know that wTx+b is just the equation of the plane so it is equal to 0.
To end up with an equation that is easy to maximize , we can scale the values of
WT Xn + b =1, basically we choose the value w that ensures this.
So, to maximize the distance between the plane and the point we just need to maximize.
Solving the optimization problem
We can change this equation around, rather than maximizing we can minimize this quadratic equation. Analytically, we can show that both these equations are the same.
Also we change the constraint on the minimization to ,
We have to change the constraint around because one minimum and absolute of a value are difficult to operate in a minimization problem.
The logic is that for at least one point xn where, yn (wTXn+b) will be equal to 1 , here yn is the final output which can be +1 or -1. This helps to ensure that all properly classified points will have a value for the above equation greater than equal to 1.
To solve this equation we use the Lagrange method
To solve lagrange under inequality constraints we have to use the KKT condition.
Note to the reader:- Now we are stepping on to some heavy mathematics, few big breaths before starting.
Without the constraint we can just differentiate the equation and set the value equal to 0.
But because of the constrained condition we can only optimize to a certain point, where our constraint is also met. This is where lagrange method can help us solve the equation.
With the lagrange formulation we want to bring the constraint inside our optimization equation.
We multiply it with the lagrange coefficient, after this we subtract it from the actual optimization problem, we subtract it as the inequality was greater than equal to 0.
There is detailed mathematics on why this equation is correct, but for the sake of simplicity we will not be covering that part in this blog
This is lagrange formulation of the final optimization condition. In this equation we end up with additional lagrange coefficients for each value in the dataset.
One another bunch of math that I want you to take for granted is that we are actually maximizing the lagrangian formulation with respect to alpha
Now we have a new constraint that alpha is greater than 0.
Because of this added constraint, we can’t just simply set the gradient with respect to alpha and set it equal to 0. As there is a constraint on it.
Now let’s differentiate with respect to w and b and set them equal to 0
For the next step substitute the values of w and b that we get from the equation into the equation.
Once we substitute the values in lagrangian, and we can get this equation just in terms of alpha.
Some simple observation on what happens to the terms when they drop out are:-
- Sigma alphan yn equals to 0
- We get a term of Sigma alpha n by multiplying to -1
- We get the last term by subtraction the ½ wtw with the first terms of the summation.
Now all the equation is in terms of the lagrange coefficient alpha.
We can also transform the equation to minimization problem as it is easier for computer libraries to handle.
But even now we have to take care of the constraint on alpha. We also have to pass this constraints along with the value of output variable y and input variable x to the software library.
We use quadratic programing the get the values of alpha.
Inside the matrix the dimension of X is n by 1 , hence we end with multiplication of bunch of scalars.
Once we find the values of alpha, we observe that most of the values of alpha are simply equal to 0. It is only for some vectors that the value is not equal to 0.
These vectors are known as support vectors as these vectors determine the shape of the hyperplane.
The hyperplane is actually dependent on a few points rather than the whole data set.
From the equation of the quadratic programming you can develop the intuition that you can work in any infinite dimensional space.
All you would need is the dot product of the data in that dimension,
This is where the kernel methods come from also, for more explanation go to this blog.
Generalization in support vector Machines
If there are a lot of support vectors the results of support vector machines, will not generalize properly for out of dataset values.
Bias and variance tradeoff in support vector machines
If your SVM tries to fit outliers, it can have high variance.
Sp to balance the bias and variance tradeoff in SVM we need to go for regularization.
Maybe we do not find the most optimum solution, in order to balance a few outliers.
We regularize SVM, by introducing an error margin in the equation.
En is the extent to which a support vector can violate the margin.
C is the relative importance of the violation term in comparison to the margin of the points
In this case the equation that we have an added term in the optimization problem that is the total violation that is happening in the system.
The only impact of beta in this case is that alpha cannot be greater than C because in that case the differential of En won’t be equal to 0
IMPORTANCE OF C
A very large value of C , signifies that you do not want any data points to violate the margin. In this case you can have a high bias since you can overfit your data
On the other side, if you have a small value of C, you are given permission to your data points to violate the margin. Therefore, you will have a large bias but lower variance.
Assumptions under SVM and stuff to be careful about
SVM is quite tolerant of input data, especially the soft-margin version.
The only consideration in SVM is that you have to calculate the dot product of the input vectors.
Hence, it is generally recommended to work with SVM with small data sets less than 10,000. Otherwise you can end up with performance issues.
The advantages of support vector machines are:
- Effective in high dimensional spaces.
- Still effective in cases where number of dimensions is greater than the number of samples.
- Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
- Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.
The disadvantages of support vector machines include:
- If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
- SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).
SVM in the case of N different Classes
The below explanation is when you have 2 classes of data. However SVM, works for data with more than 2 classes as well.
Assume you have N different classes. One vs all will train one classifier per class in total Nclassifiers. For class ii it will assume ii-labels as positive and the rest as negative. This often leads to imbalanced datasets meaning generic SVM might not work, but still there are some workarounds.
In one vs one you have to train a separate classifier for each different pair of labels. This leads to N(N−1)*N(N−1)2 classifiers. This is much less sensitive to the problems of imbalanced datasets but is much more computationally expensive.
No comments yet.Add your comment