Introduction to SVM:
In Machine Learning we have multiple supervised algorithm models available when it comes to solving both regression and classification. Support Vector Machines(SVM) is one of the most popular Supervised Machine Learning Algorithms that can analyze the data and solve both classification and Regression problems. However, this algorithm is actually widely used for Classification Problems.
SVM is actually much better than other algorithms like Logistic Regression and Decision tree as it is able to work on smaller and very complex datasets pretty easily. Let’s say you have a classification task and you try to use all the algorithms but most of the time SVM model will give the highest accuracy when compared to the others. It basically creates multiple hyperplanes and then chooses the best one that can classify the data perfectly.
Working of SVM:
As mentioned before SVM plots the data points on multidimensional space and then try to find the optimal hyperplane that can separate one class from another.
The creation of multiple dimensions takes place according to the number of features in the dataset. So if we say it creates n-dimension then n denotes the number of columns in the dataset. A hyperplane can be anything, a point, a line or a plane.
Part 1 – Python for Data Science – Jupyter Notebook
A hyperplane is basically a decision boundary, it is imaginary as we cannot exactly visualize the hyperplane. Hyperplanes separate different classes such that the data points that fall on either side of the hyperplane will be classified into different classes. The hyperplane is a multidimensional figure which means its dimension varies according to the dimension of the plane.
The different types of hyperplanes that can be possible are:
- If it’s a 1-Dimensional space then the Hyperplane will be a point.
- If it’s a 2-Dimensional space then the Hyperplane will be a line.
- If it’s a 3-Dimensional space then the Hyperplane will be a plane.
For Example, let us assume a 1-D linear data that is plotted on the 1D space. So in a 1D space hyperplane is a point that separates the data points.
- As you can see we have a hyperplane which is a point here because the data is on a 1D plane.
- When the hyperplane is able to perfectly separate the two classes then the hyperplane is called the Maximum Marginal Hyperplane.
- The distance between the hyperplane and the edge of the observations is called Marginal Distance.
- To choose the hyperplane as the best the distance between the hyperplane and the edge of the observation of the two classes should be maximum.
In case of outliers:-
- The hyperplane is highly sensitive to outliers and therefore results in multiple misclassifications.
- Let us consider an example classification problem of two classes, Obese and Non Obese.
- Now to make the Hyperplane robust to outliers we must allow the misclassification to occur.
- From the above figure, when we have a datapoint towards the Obese data and there exists an outlier, we allow the misclassification and just ignore the presence of outliers.
- Then we will get the hyperplane at the midpoint and the data points will now be classified correctly. This is called a soft margin hyperplane.
- So the test data falls in the obese class so that data point gets classified as Obese.
Anything more than 3-Dimensional space cannot be interpreted by us. Therefore, to explain the working of the algorithm let us consider 2 features or 2-Dimensional space and look at how SVM does the Classification.
Let us consider a binary classification problem with two features x1 and x2. So for two features we plot the data points in a 2-Dimensional plane. Let us assume that it’s a linear dataset.
The red dots are class A and the Blue dots are class B.
The Algorithm plots multiple hyperplanes in the space and since this is a 2D plane so the hyperplane in this case will be a line. So the task is to find the best hyperplane that can segregate one class from another. Let us assume there are three hyperplanes (H1,H2,H3) that are able to separate the classes.
H1, H2, H3 are the Hyperplanes
So from the above figure, it’s clear that the algorithm has identified three potential hyperplanes that may be able to separate the classes. In the next step, the algorithm needs to figure out which hyperplane is the best one.
Now the algorithm calculates the distance between the nearest data points of each class to each Hyperplane. That distance is called Marginal Distance.
So the hyperplane that has the maximum marginal distance will be selected as the optimal Hyperplane. The idea behind this is that data points of one class should be far away from the points of another class. So from the above figure let’s say the H2 hyperplane is the Maximum Margin Hyperplane and the points from each class that is closest to the Optimal Hyperplane are called Support Vectors, therefore, the name Support Vector Machine.
Now you have a model that can correctly classify the data and predict the outcome. So now if we pass random data into the model, it should be able to predict if the data belongs to Class A or Class B.
You can refer below articles:
What if there is an outlier present in the data?
This is a very obvious question that we get. Now SVM works in a different way when this kind of scenario arises. Let us consider a situation where one of the data points from Class A has been misplaced in Class B. So that particular point is considered an outlier.
In this situation, the SVM has a very special ability to completely ignore the outlier and normally find the maximum margin hyperplane as usual. So we can say that the SVM algorithm is highly robust towards outliers. This is what makes SVM a very powerful and highly efficient algorithm. So it doesn’t matter how many outliers are present in the dataset, SVM will still manage to find the optimal hyperplane and classify the data correctly.
So far we have only looked at Linear Data where SVM does classification very easily without much processing. But in a real-world scenario, it is not possible to get a proper linear dataset. So realistically we generally get a Non-Linear Dataset and here SVM uses a special function called Kernel Function. So what is this kernel function, let us see below.
Kernel Function is a special feature of SVM which is used when we have Non-Linear datasets. SVM uses a kernel function to basically transform the Non-Linear data into a Linear data form by plotting it into a higher dimensional space. This is done by using some complex Mathematical equations.
Some of the most widely used Kernel Functions available for SVM are:
- Polynomial Kernel: This type of kernel basically uses a complex polynomial equation and is able to transform the Non-Linear data into Linear data.
The equation used here is,
(a * b + r)^d
Here, a and b are the observations from the dataset.
r is a Coefficient Value
d is the degree of the polynomial
2. Radial Basis Function Kernel: This uses a different mathematical expression. It tries to compare the two closest observations and identify if they have more influence on each other and then proceeds to group them.
The equation used here is, 𝛆^(-ῳ(a-b)^2) Here, ῳ is the Gamma function A and b are two observations from the dataset.
Note: If the data is linear then we use a linear kernel.
Let’s look at how SVM implements these kernel functions to transform data. Let’s assume we have a Non-Linear dataset as you can clearly see below.
Now we initialize the SVM algorithm along with one of the kernel functions. So the SVM projects the Non-Linear data to a higher dimension plane(Z-axis) and basically converts Non-separable data to Separable data and then identifies the Optimal Hyperplane.
Here’s what the data looks like after using Kernel Function:
As you can see we now have a 3rd dimension(z-axis) after using the kernel. And the data is linearly separated now. Now SVM can find the optimal Hyperplane using the regular method. But now that it’s in 3D space, the hyperplane will be a plane instead of a line.
As you can see SVM has identified the optimal hyperplane that perfectly separated both classes. However, since this is a 3D space, the hyperplane here is in the form of a circle.
When it comes to machine learning algorithms, most of the algorithms have some predefined parameters that are automatically used by the model, but there are also some very special kinds of parameters that can only be set by the user who is using the algorithm. These special parameters are called Hyperparameters. We generally use hyperparameters before fitting the model with training data. So before doing that we need to tune our model by using the best hyperparameter using a process called Hyperparameter Tuning. Most of the time we perform tuning to make the algorithm more efficient and robust.
When it comes to SVM we have two important Hyperparameters, C and Gamma.
C parameter is used when we have misclassification in a dataset. It basically adds a penalty for every misclassification.
Gamma parameter is useful when we use RBF kernel for our SVM model i.e when we have a Non-Linear dataset.
We basically create a dictionary of C and gamma parameters with some set of values. We use the concept of outer and inner for loop statements. So we create multiple models with different C and gamma values and then we select the model with the best parameters that give us the highest accuracy.
Now that we know how SVM works in the background, let’s look at the Python implementation of SVM with a small example code.
Python Implementation of SVM:
Here’s what the classification looks like after applying SVM using Linear Kernel.
Now let us use the RBF kernel and see what’s the difference.
We just add the following code,
svc = svm.SVC(kernel='rbf', C=1,gamma=0).fit(X, y)
Advantages of SVM:
- It works really well when the data is clearly separated according to the classes.
- It performs much better when the data is in a higher dimension.
- It’s more effective if the number of features is greater than the number of rows.
- It is highly memory efficient i.e it uses very less memory for training data.
- It works well even if we don’t handle the outliers.
Disadvantages of SVM:
- It doesn’t do well when we have a larger dataset with a lot of features and observations therefore it takes a lot of time to train.
- It doesn’t perform well if there is the presence of overlapping data with the target feature which is called noise in the data.
- Can be difficult to choose the correct kernel as each one uses different mathematical functions.
- We have to scale the features or else it gives less performance.
Being a prominent data science institute, DataMites provides specialised training in topics including deep learning, machine learning, artificial intelligence, the internet of things, and Python. Our Machine Learning Courses at DataMites have been authorised by the International Association for Business Analytics Certification (IABAC), a body with a strong reputation and high appreciation in the analytics field.
What is HR analytics? – HR analytics using Python