Every algorithm has its magic. The demand for data forced every data scientist to learn different algorithms. Most of the industries are deeply involved in Machine Learning and are interested in exploring different algorithms. Support Vector Machine is one such algorithm. It is considered as the black box technique as there are unknown parameters that are not so easy to interpret and assume how it works. It depends on three working principles:

- Maximum margin classifiers
- Support vector classifiers
- Support vector machines

Let us try to understand each principle in an in-depth manner.

## 1.**Maximum margin classifier**

They are often generalized with support vector machines but SVM has many more parameters compared to it. The maximum margin classifier considers a hyperplane with maximum separation width to classify the data. But infinite hyperplanes can be drawn in a set of data. It is indeed important to choose the ideal hyperplane for classification. Now let us understand what a hyperplane is.

*In n-dimensional space, a hyperplane is a subspace of the n-1 dimension. Meaning, if a data has 2-dimensional space, then the hyperplane can be a straight line dividing the data space into halves. It goes by the following equation *

The observation falling in the hyperplane follows the above equation. The observation that falls on the region above or below the hyperplane follows the below equation:

The maximum margin classifier often fails in the situation of non-separable cases where it cannot allot a different hyperplane to classify non-separable data. For such cases, a support vector classifier comes to the rescue.

From the above diagram, we can assume infinite hyperplanes(left). The maximum margin classifier comes with a single hyperplane that divides the data(right). The data touching the positive and negative hyperplanes is referred to as support vectors.

## 2. **Support Vector Classifiers**

This type of classifier can be regarded as an extended version of the maximum margin classifier which also deals with the non-separable cases. When we deal with real-life data, we find, most of the observations are in overlapping classes. This is why we use support vector classifiers.

Let us consider a tuning parameter C. In this classifier, the high value of C can give us a robust model. A lower value of C gives us a flexible model. Let us understand with the following diagram.

Let us take a look at the diagram on the left. We can see that the higher values of C delivered more errors which are regarded as a violation. The diagram on the right shows a lower value of C and does not provide a sufficient chance of violation by reducing the margin width.

## 3. **Support Vector Machine**

The support vector machine approach is considered during a non-linear decision and the data is not separable by a support vector classifier irrespective of the cost function.

The diagram illustrates the inseparable classes in a one-dimensional and two-dimensional space.

When it is almost difficult to separate non-linear classes, we then apply another trick called kernel trick that helps handle the data.

In the above diagram, the data that was inseparable in one-dimension got separated once it was transformed into two-dimensions and after applying a polynomial kernel of the second degree. Now let us see how to handle the two-dimensional linearly inseparable data.

In two-dimensional data, the polynomial kernel of the second degree is applied by using a linear plane after transforming it to higher dimensions.

**Kernel Functions**

Kernel functions can also be regarded as the tuning parameters in an SVM model. They are responsible for removing the computational requirement to achieve the higher dimensional vector space and deal with the non-linear separable data. Let us discuss two of the widely used kernel functions:

- Polynomial kernel
- Radial Basis Function kernel

**1**. **Polynomial Kernel**

A polynomial function is used with a degree 2 to separate the non-linear data by transforming them into higher dimensions. Take a look at the following equation:

#### 2. **Radial Basis Function kernel**

This kernel function is also known as the Gaussian kernel function. It is capable of producing an infinite number of dimensions to separate the non-linear data. It depends on a hyperparameter ‘γ'(gamma) which needs to be scaled while normalizing the data. The smaller the value of the hyperparameter, the smaller the bias and higher the variance it gives. While a higher value of hyperparameter gives a higher bias and lower variance solutions. It is explained with the help of the following equation:

Let us understand the impact of gamma with an example:

Here, as we can see, the large value of gamma gives us a softer and broader bump compared to the small value that gives us a pointed bump in higher dimensions.

**Advantages and disadvantages**

Some of the advantages of SVM are

- They are flexible in unstructured, structured and semi structured data.
- Kernel function eases the complexities in almost any data type.
- Overfitting is less observed compared to other models.

Despite these advantages, it also holds certain disadvantages which are

- Training time is more while computing large datasets.
- Hyperparameters are often challenging while interpreting their impact.
- Overall interpretation is difficult because of some black box approaches.

**Support Vector Machine Applications**

**Healthcare sector**

SVM can be applied in healthcare sectors to predict the condition of the patient, to predict the chances of dangerous diseases. It also plays a large role in medicine composition.

In the imaging category, it can easily classify the images produced by lab machines for detecting illness and body conditions.

2. **Banking Sector**

It can be used to predict the nature of fraudulent customers and can also predict the credit risk. During the sanction of a loan, it can predict the eligibility of a customer.

3. **Social Networking Domain**

It can handle large amounts of unstructured data such as text. This is why it is used to classify types of texts in social networking platforms.

**Case Studies**

**SVM in Python**

First we will try and implement an SVM model in Python. We will take a social network dataset which contains features such as age and salary of a person to predict whether they purchased the product or not.

Let us import all the necessary libraries-

```
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
```

We are done importing the libraries. Now it is time to import the data and start building the model.

```
dataset = pd.read_csv('Social_Network_Ads.csv')
dataset.head()
```

We do not necessarily need all the columns to make an analysis. So we will take only the age and salary column as the feature values.

```
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
```

Now we will split the data into training and testing. We will take 75% of the data for training and test it on the rest which makes 25%.

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
```

After splitting, we will scale the feature to expect normal distribution of the data. We will use Standard Scaler in this case.

```
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
```

Now the preprocessing of the data is over. It is time to build the model. We will apply three kernel tricks in this case and try evaluating them.

```
from sklearn.svm import SVC
```

**Model for linear kernel**

```
classifier_linear = SVC (kernel = 'linear', random_state = 0)
classifier_linear.fit(X_train, y_train)
```

We have built our first model. Now we will predict our test data.

`y_pred _linear= classifier.predict(X_test)`

**Model for polynomial kernel**

```
classifier _poly= SVC (kernel = 'poly', random_state = 0)
classifier_poly.fit(X_train, y_train)
y_pred_poly = classifier_poly.predict(X_test)
```

**Model for RBF kernel**

```
classifier _rbf= SVC (kernel = 'rbf', random_state = 0)
classifier_rbf.fit(X_train, y_train)
y_pred _rbf= classifier_rbf.predict(X_test)
```

We have made models on three kernels. The performance will be evaluated on the basis of confusion matrix and classification report.

```
from sklearn.metrics import confusion_matrix
```

**Confusion matrix for linear kernel**

Also Read: Confusion Matrix – with Python and R

```
cm_linear = confusion_matrix(y_test, y_pred_linear)
```

**Confusion matrix for polynomial kernel**

```
cm_poly=confusion_matrix(y_test,y_pred_poly)
```

**Confusion matrix for RBF kernel**

```
cm_rbf=confusion_matrix(y_test,y_pred_rbf)
```

If we evaluate the model based on the confusion matrix, then we can conclude that the model with the RBF kernel gave us the best result. But that is not done yet. Let us jump to the classification result to look into the same.

**Classification report**

Classification report is an extended metric of confusion matrix which gives an idea about precision and recall. It tells us how well our model is and the exact accuracy of our model.

**Classification report on linear kernel**

```
class_report_linear = classification_report(y_test, y_pred_linear)
```

**Classification report on polynomial kernel**

```
class_report_poly = classification_report (y_test,y_pred_poly)
```

**Classification report on RBF kernel**

```
class_report_rbf= classification_report (y_test,y_pred_rbf)
```

Even after evaluating based on the classification report, the F1-score seems to be higher in the model with the RBF kernel compared with the other two models. So, the model with the RBF kernel seems to be the winner. But it still depends on the data handling and preprocessing to take an early call.

**SVM in R**

We will use the same dataset in order to compare how python varies from R in producing results. First we will import the dataset and encode the target variable as a factor.

```
dataset = read.csv('Social_Network_Ads.csv')
```

We will take the same features which we took to build a model in Python.

```
dataset = dataset [3:5]
```

In R, target features are very important to be converted as a factor before building a model.

```
# Encoding the target feature as factor
dataset$Purchased = factor (dataset$Purchased, levels = c (0, 1))
```

Now let us perform the same splitting as we performed in Python. We will set a seed in order to fetch the same split when you try it. So basically you will get the same result if you maintain the same seed.

```
install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset (dataset, split == TRUE)
test_set = subset (dataset, split == FALSE)
```

Now we will scale the feature data to maintain a normal distribution and avoid much variation.

```
training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])
```

Data preprocessing is a very important step before model building and we have completed it. Now we will build the model based on three kernel tricks, linear, polynomial and the RBF.

```
install.packages('e1071')
library(e1071)
library(kernlab)
library(caret)
```

The above three libraries are imported in order to carry out our model building with three kernels. Please make sure to import it before model building.

**Model for linear kernel**

```
linear_classifier = svm(formula = Purchased ~ .,
data = training_set,
type = 'C-classification',
kernel = 'linear')
```

We have built the model. Now it is time to predict the test labels.

```
y_pred_linear = predict (linear_classifier, newdata = test_set[-3])
```

**Model for polynomial kernel**

```
poly_classifier = ksvm(formula = Purchased ~ .,
data = training_set,
type = 'C-classification',
kernel = 'polydot')
y_pred_poly = predict (poly_classifier, newdata = test_set[-3])
```

**Model for RBF kernel**

```
rbf_classifier = ksvm(formula = Purchased ~ .,
data = training_set,
type = 'C-classification',
kernel = 'rbfdot')
y_pred_rbf = predict (rbf_classifier, newdata = test_set[-3])
```

Now the model building is done on all three kernels. It is time to evaluate the model. In R, we will evaluate based on the confusion matrix. There is no hard and fast rule to import the classification report as we did in Python. So a confusion matrix will be enough for now to make an analysis.

**Confusion matrix for linear kernel**

```
cm_linear = table (test_set[, 3], y_pred_linear)
```

**Confusion matrix for polynomial kernel**

```
cm _poly= table (test_set[, 3], y_pred_poly)
```

**Confusion matrix for RBF kernel**

```
cm_rbf = table (test_set[, 3], y_pred_rbf)
```

We have evaluated our model based on the confusion matrix and we can still say that our model performed not so great compared to the model in Python. Out of these three models, no doubt the model with rbf kernel outperformed the other two using R too. But the rate of misclassification is still high compared to the model built in Python. One reason can be the randomness considered in splitting. In such cases to evaluate both the languages, we can apply stratified sampling which would be the same in both the cases.

If you find this interesting and wish to learn more, upskill with Great Learning’s PGP – Artificial Intelligence and Machine Learning Course today!

2