Introduction to Random Forest Algorithm
In the field of data analytics, every algorithm has a price. But if we consider the overall scenario, then maximum of the business problem has a classification task. It becomes quite difficult to intuitively know what to adopt considering the nature of the data. But today we will be discussing one of the top classifier techniques, which is the most trusted by data experts and that is Random Forest Classifier. Random Forest also has a regression algorithm technique which will be covered here.
The word ‘Forest’ in the term suggests that it will contain a lot of trees. The algorithm contains a bundle of decision trees to make a classification and it is also considered a saving technique when it comes to overfitting of a decision tree model. A decision tree model has high variance and low bias which can give us pretty unstable output unlike the commonly adopted logistic regression, which has high bias and low variance. That is the only point when Random Forest comes to the rescue. But before discussing Random Forest in detail, let’s take a quick look at the tree concept.
“A decision tree is a classification as well as a regression technique. It works great when it comes to taking decisions on data by creating branches from a root, which are essentially the conditions present in the data, and providing an output known as a leaf.”
For more details, we have a comprehensive article different topic on Decision Tree for you to read.
Moreover, a random forest technique has a capability to focus both on observations and variables of a training data for developing individual decision trees and take maximum voting for classification and the total average for regression problem respectively. It also uses a bagging technique that takes observations in a random manner and selects all columns which are incapable of representing significant variables at the root for all decision trees. In this manner, a random forest makes trees only which are dependent on each other by penalising accuracy. We have a thumb rule which can be implemented for selecting sub-samples from observations using random forest. If we consider 2/3 of observations for training data and p be the number of columns then
- For classification, we take sqrt(p) number of columns
- For regression, we take p/3 number of columns.
The above thumb rule can be tuned in case you like increasing the accuracy of the model.
Let us interpret both bagging and random forest technique where we draw two samples, one in blue and another in pink.
From the above diagram, we can see that the Bagging technique has selected a few observations but all columns. On the other hand, Random Forest selected a few observations and a few columns to create uncorrelated individual trees.
A sample idea of a random forest classifier is given below
The above diagram gives us an idea of how each tree has grown and the variation of the depth of trees as per sample selected but in the end process, voting is performed for final classification. Also, averaging is performed when we deal with the regression problem.
Classifier Vs. Regressor
A random forest classifier works with data having discrete labels or better known as class.
Example- A patient is suffering from cancer or not, a person is eligible for a loan or not, etc.
A random forest regressor works with data having a numeric or continuous output and they cannot be defined by classes.
Example- the price of houses, milk production of cows, the gross income of companies, etc.
Advantages and Disadvantages of Random Forest
- It reduces overfitting in decision trees and helps to improve the accuracy
- It is flexible to both classification and regression problems
- It works well with both categorical and continuous values
- It automates missing values present in the data
- Normalising of data is not required as it uses a rule-based approach.
However, despite these advantages, a random forest algorithm also has some drawbacks.
- It requires much computational power as well as resources as it builds numerous trees to combine their outputs.
- It also requires much time for training as it combines a lot of decision trees to determine the class.
- Due to the ensemble of decision trees, it also suffers interpretability and fails to determine the significance of each variable.
Applications of Random Forest
Banking analysis requires a lot of effort as it contains a high risk of profit and loss. Customer analysis is one of the most used studies adopted in banking sectors. Problems such as loan default chance of a customer or for detecting any fraud transaction, random forest can be a great choice.
The above representation is a tree which decides whether a customer is eligible for loan credit based on conditions such as account balance, duration of credit, payment status, etc.
In pharmaceutical industries, random forest can be used to identify the potential of a certain medicine or the composition of chemicals required for medicines. It can also be used in hospitals to identify the diseases suffered by a patient, risk of cancer in a patient, and many other diseases where early analysis and research play a crucial role.
Applying Random Forest with Python and R
We will perform case studies in Python and R for both Random forest regression and Classification techniques.
Random Forest Regression in Python
For regression, we will be dealing with data which contains salaries of employees based on their position. We will use this to predict the salary of an employee based on his position.
Let us take care of the libraries and the data:
import numpy as np import matplotlib.pyplot as plt import pandas as pd df = pd.read_csv(‘Salaries.csv') df.head()
X =df.iloc[:, 1:2].values y =df.iloc[:, 2].values
As the dataset is very small we won’t perform any splitting. We will proceed directly to fitting the data.
from sklearn.ensemble import RandomForestRegressor model = RandomForestRegressor(n_estimators = 10, random_state = 0) model.fit(X, y)
Did you notice that we have made just 10 trees by putting n_estimators=10? It is up to you to play around with the number of trees. As it is a small dataset, 10 trees are enough.
Now we will predict the salary of a person who has a level of 6.5
After prediction, we can see that the employee must get a salary of 167000 after reaching a level of 6.5. Let us visualise to interpret it in a better way.
X_grid_data = np.arange(min(X), max(X), 0.01) X_grid_data = X_grid.reshape((len(X_grid_data), 1)) plt.scatter(X, y, color = 'red') plt.plot(X_grid_data,model.predict(X_grid_data), color = 'blue') plt.title('Random Forest Regression’) plt.xlabel('Position') plt.ylabel('Salary') plt.show()
Random Forest Regression in R
Now we will be doing the same model in R and see how it creates an impact in prediction
We will first import the dataset:
df = read.csv('Position_Salaries.csv') df = df[2:3]
In R too, we won’t perform splitting as the data is too small. We will use the entire data for training and make an individual prediction as we did in Python
We will use the ‘randomForest’ library. In case you did not install the package, the below code will help you out.
install.packages('randomForest') library(randomForest) set.seed(1234)
The seed function will help you get the same result that we got during training and testing.
model= randomForest(x = df[-2], y = df$Salary, ntree = 500)
Now we will predict the salary of a level 6.5 employee and see how much it differs from the one predicted using Python.
y_prediction = predict(model, data.frame(Level = 6.5))
As we see, the prediction gives a salary of 160908 but in Python, we got a prediction of 167000. It completely depends on the data analyst to decide which algorithm works better. We are done with the prediction. Now it’s time to visualise the data
install.packages('ggplot2') library(ggplot2) x_grid_data = seq(min(df$Level), max(df$Level), 0.01) ggplot()+geom_point(aes(x = df$Level, y = df$Salary),colour = 'red') +geom_line(aes(x = x_grid_data, y = predict(model, newdata = data.frame(Level = x_grid_data))),colour = 'blue') +ggtitle('Truth or Bluff (Random Forest Regression)') + xlab('Level') + ylab('Salary')
So this is for regression using R. Now let us quickly move to the classification part to see how Random Forest works.
Random Forest Classifier in Python
For classification, we will use Social Networking Ads data which contains information about the product purchased based on age and salary of a person. Let us import the libraries
import numpy as np import matplotlib.pyplot as plt import pandas as pd
Now let us see the dataset:
df = pd.read_csv('Social_Network_Ads.csv') df
For your information, the dataset contains 400 rows and 5 columns.
X = df.iloc[:, [2, 3]].values y = df.iloc[:, 4].values
Now we will split the data for training and testing. We will take 75% for training and rest for testing.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
Now we will standardise the data using StandardScaler from sklearn library.
from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test)
After scaling, let us see the head of the data now.
Now it’s time to fit our model.
from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0) model.fit(X_train, y_train)
We have made 10 trees and used criterion as ‘entropy ’ as it is used to decrease the impurity in the data. You can increase the number of trees if you wish but we are keeping it limited to 10 for now.
Now the fitting is over. We will predict the test data.
y_prediction = model.predict(X_test)
After prediction, we can evaluate by confusion matrix and see how good our model performs.
from sklearn.metrics import confusion_matrix conf_mat = confusion_matrix(y_test, y_prediction)
Great. As we see, our model is doing well as the rate of misclassification is very less which is interesting. Now let us visualise our training result.
from matplotlib.colors import ListedColormap X_set, y_set = X_train, y_train X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01)) plt.contourf(X1,X2,model.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),alpha = 0.75, cmap = ListedColormap(('red', 'green'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j) plt.title('Random Forest Classification (Training set)') plt.xlabel('Age') plt.ylabel('Salary') plt.legend() plt.show()
Now let us visualise test result in the same way.
from matplotlib.colors import ListedColormap X_set, y_set = X_test, y_test X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01)) plt.contourf(X1,X2,model.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),alpha=0.75,cmap= ListedColormap(('red', 'green'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j) plt.title('Random Forest Classification (Test set)') plt.xlabel('Age') plt.ylabel('Estimated Salary') plt.legend() plt.show()
So that’s for now. We will move to perform the same model in R.
Random Forest Classifier in R
Let us import the dataset and check the head of the data
df = read.csv('SocialNetwork_Ads.csv') df = df[3:5]
Now in R, we need to change the class to factor. So we need further encoding.
df$Purchased = factor(df$Purchased, levels = c(0, 1))
Now we will split the data and see the result. The splitting ratio will be the same as we did in Python.
install.packages('caTools') library(caTools) set.seed(123) split_data = sample.split(df$Purchased, SplitRatio = 0.75) training_set = subset(df, split_data == TRUE) test_set = subset(df, split_data == FALSE)
Also, we will perform the standardisation of the data and see how it performs while testing.
training_set[-3] = scale(training_set[-3]) test_set[-3] = scale(test_set[-3])
Now we fit the model using the built-in library ‘randomForest’ provided by R.
install.packages('randomForest') library(randomForest) set.seed(123) model= randomForest(x = training_set[-3], y = training_set$Purchased, ntree = 10)
We set the number of trees to 10 to see how it performs. We can set any number of trees to improve accuracy.
y_prediction = predict(model, newdata = test_set[-3])
Now the prediction is over and we will evaluate using a confusion matrix.
conf_mat = table(test_set[, 3], y_prediction) conf_mat
As we see the model underperforms compared to Python as the rate of misclassification is high.
Now let us interpret our result using visualisation. We will be using ElemStatLearn method for smooth visualisation.
library(ElemStatLearn) train_set = training_set X1 = seq(min(train_set [, 1]) - 1, max(train_set [, 1]) + 1, by = 0.01) X2 = seq(min(train_set [, 2]) - 1, max(train_set [, 2]) + 1, by = 0.01) grid_set = expand.grid(X1, X2) colnames(grid_set) = c('Age', 'EstimatedSalary') y_grid = predict(model, grid_set) plot(set[, -3], main = 'Random Forest Classification (Training set)', xlab = 'Age', ylab = 'Estimated Salary', xlim = range(X1), ylim = range(X2)) contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE) points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato')) points(train_set, pch = 21, bg = ifelse(train_set [, 3] == 1, 'green4', 'red3'))
The model works fine as it is evident from the visualisation of training data. Now let us see how it performs with the test data.
library(ElemStatLearn) testset = test_set X1 = seq(min(testset [, 1]) - 1, max(testset [, 1]) + 1, by = 0.01) X2 = seq(min(testset [, 2]) - 1, max testset [, 2]) + 1, by = 0.01) grid_set = expand.grid(X1, X2) colnames(grid_set) = c('Age', 'EstimatedSalary') y_grid = predict(model, grid_set) plot(set[, -3], main = 'Random Forest Classification (Test set)', xlab = 'Age', ylab = 'Estimated Salary', xlim = range(X1), ylim = range(X2)) contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE) points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato')) points(testset, pch = 21, bg = ifelse(testset [, 3] == 1, 'green4', 'red3'))
That’s it for now. The test data just worked fine as expected.
Random Forest works well when we are trying to avoid overfitting from building a decision tree. Also, it works fine when the data mostly contain categorical variables. Other algorithms like logistic regression can outperform when it comes to numeric variables but when it comes making a decision based on conditions, the random forest is the best choice. It completely depends on the analyst to play around with the parameters to improve accuracy. There is often less chance of overfitting as it uses a rule-based approach. But yet again, it depends on the data and the analyst to choose the best algorithm.
If you wish to learn more about the Random Forest or other Machine Learning algorithms, upskill with Great Learning’s PG Program in Machine Learning.1