Random forest Algorithm in Machine learning

The Random Forest algorithm is one of the most popular and best-performing machine learning algorithms available today. Random forests are an ensemble learning technique that works by constructing multiple decision trees with diverse samples, thereby helping to build a more accurate and robust model.

Random Forest can indeed be used for this type of algorithm because it is applicable to both classification and regression problems.

In this guide, we will discuss the working and advantages of the Random Forest Algorithm, its operation, applications, and how it functions. We will also explore how to optimize the Random Forest Algorithm for optimal results.

What is the Random Forest Algorithm?

Thus, Random Forest is an ensemble method for both classification and regression. It works by building several decision trees during training and creating a class that receives the most votes among them for classification or the mean for regression.

The Random part of random forests refers to how the algorithm introduces randomness into the model-building process. This aggregation process helps minimize overfitting and enhances prediction accuracy.

Finally, we can think of the Random Forest algorithm as a forest of decision trees, where each tree is trained on a subset of the data and features. The intuition behind bagging is that by combining the predictions from multiple trees, the model as a whole will be more stable, accurate, and less prone to errors.

Learn about the robust Random Forest algorithm in this free course. Gain practical knowledge and earn a certificate upon completion.

How Does the Random Forest Algorithm Work?

Random Forest is an ensemble learning method used primarily for classification and regression tasks. It works by constructing multiple decision trees during training and outputting the mode (for classification) or mean (for regression) of the predictions of individual trees. Here’s how it works step by step:

1. Bootstrapping (Random Sampling with Replacement)

In Random Forest, each tree is trained on a different subset of the original dataset. This is done by bootstrapping — randomly selecting samples from the dataset with replacement. Some data points may be repeated in the sample, while others might be left out. This technique increases the diversity of trees in the forest.

Example:

Suppose you have a dataset with 100 observations. For the first tree, you randomly select 100 observations (with replacement). This may result in some observations appearing more than once, while others may be excluded from the training data of that tree. Repeat this process for multiple trees.

2. Random Feature Selection

During the creation of each decision tree, Random Forest omits feature evaluation at every node by using only random feature subsets. Random Forest selects a random group of features, after which it determines the best element from that set to perform data splitting operations. The implementation of reduction methods between trees creates a more dependable model.

Example:

The dataset contains five features (A, B, C, D, E). Random Forest selects features A, B, and C for one tree to determine the optimal split point between them. Random Forest selects different features for separate trees, such as B, D, and E, to prevent decision trees from having overlapping correlations with each other.

3. Building Multiple Decision Trees

Once the bootstrapped dataset and a random subset of features are selected, a decision tree is built. This tree is grown until it reaches the maximum depth (or other stopping criteria, such as a minimum number of samples in a node). Each tree is constructed independently of the others.

Example:

Each decision tree in Random Forest analyzes product purchase classification by employing various structures that depend on the training samples and their associated features.

4. Making Predictions

For classification, each tree in the forest makes a prediction, and the final output is the majority vote from all trees. For regression, the final prediction is the average of all the trees’ predictions.

Example for Classification:

Imagine a Random Forest with 3 trees, each making the following prediction on whether a customer will purchase a product:

Tree 1: No
Tree 2: Yes
Tree 3: Yes

The majority vote is “Yes,” so the Random Forest classifies the customer as likely to purchase the product.

Example for Regression:

Suppose the Random Forest is predicting the price of a house, and the 3 trees predict:

Tree 1: $250,000
Tree 2: $270,000
Tree 3: $260,000

The final prediction would be the average: ($250,000 + $270,000 + $260,000) / 3 = $260,000.

Key Advantages of Random Forest:

Reduces Overfitting: By averaging predictions (for regression) or using majority voting (for classification), it reduces variance and prevents overfitting, a common issue in single decision trees.
Handles Missing Data: Random Forest can handle missing data in a dataset.
Works with Large Datasets: It is capable of handling large datasets efficiently and can also manage data with high-dimensionality.

Summary:

Bootstrap samples of the data are used to train individual decision trees.
Random feature selection ensures each tree uses a different subset of features to make splits.
Multiple trees are created and then aggregated (via voting or averaging) to make the final prediction.

Example in Python (using sklearn):

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load dataset
data = load_iris()
X = data.data
y = data.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train the Random Forest model
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
# Make predictions
y_pred = rf.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Advantages of the Random Forest Algorithm

There are several advantages to using the Random Forest algorithm, which has contributed to its popularity among machine learning practitioners:

Robustness and Accuracy:

High accuracy, especially with complex datasets.
Reduces variance and overfitting by utilizing multiple decision trees, resulting in more reliable predictions.

Versatility:

Can be used for both classification and regression tasks.
Widely used across domains like finance, healthcare, and marketing.

Handling Missing Data:

Efficiently handles missing data by leveraging available features from other trees for accurate predictions.

Feature Importance:

Identifies the importance of features by analyzing their contribution to tree decision-making.

Resistance to Overfitting:

Less prone to overfitting compared to individual decision trees, especially with noisy data.

Parallelizable:

Trees can be trained independently and in parallel, speeding up the model-building process, particularly with large datasets.

Applications of the Random Forest Algorithm

With its versatility and stability, Random Forest proves to be the first choice for many applications. Some of the more common use cases are as follows:

Medical Diagnosis:

Used for diagnosing diseases like cancer, diabetes, and heart disease.
Learns from historical medical data to predict outcomes and suggest appropriate doctors.

Financial Sector:

Applied in credit scoring, fraud detection, and risk assessment.
Analyzes transaction and customer behavior data to identify anomalies.

E-commerce and Marketing:

Used for customer segmentation, product recommendations, and sales forecasting.
Helps marketers understand customer needs for better product targeting.

Agriculture:

Predicts crop yields, detects crop diseases, and optimizes irrigation schedules.
Analyzes environmental and agricultural data for improved resource management.

Natural Language Processing (NLP):

Identifies patterns and classifies text data.
Extensively used in text mining and text processing applications.

Random Forest Classifier: A Special Focus on Classification Tasks

The Random Forest Classifier is a variant of the Random Forest algorithm, designed explicitly for classification tasks. In such a classification problem, the goal is to predict the class or category of an observation based on a given feature set.

During this training, both the decision forests Random Forest Classifier is built based on a collection of decision trees, each of which is assigned to a new observation, the most likely class label.

The process is essentially the same as the general Random Forest algorithm; the major variation here is that, instead of predicting a continuous value (similar to regression), the model predicts a discrete class. The final output is the class with the most “votes” among the trees.

The random forest classifier is well-suited for datasets with high dimensions and large feature counts. It is also more robust to data noise, meaning it has advantages in the presence of outliers, missing values, or when the dataset is small, as is often the case.

Hyperparameter Tuning and Optimization

While Random Forest is a robust algorithm, its performance can be further improved through hyperparameter tuning and optimization. Some important hyperparameters to consider when training a Random Forest model include:

n_estimators: The number of trees to be generated in the forest. More trees would generally perform better, but it would come at the cost of computer efficiency. However, we need to balance model performance versus compute efficiency.
max_depth: This is the maximum depth of each tree. Deeper trees may capture more complex patterns in the data, but they also tend to overfit the data. This parameter is tuned to control the complexity of the model.
Minimum Samples Split (min_samples_split): It controls the Minimum number of samples required to split an internal node. The higher the level of generalization, the lower the risk of overfitting. Tuning this parameter to its optimal value can help optimize the model’s performance.
Max Features: This parameter determines the number of features to consider when searching for the best split. This helps to avoid overfitting. However, fewer features may also decrease model accuracy. You may wish to try out various values as a tradeoff.
Bootstrap Sampling (bootstrap): It defines whether bootstrapping is used when building the trees. Bootstrapping allows each tree to be over a different randomly generated portion of the data set, thus increasing diversity and reducing overfitting.

Hyperparameter tuning via Grid Search or Random Search for hyperparameter optimization would result in a more optimized and consequently performing Random Forest model.

Applying Random Forest with Python and R

We will conduct case studies in Python and R, utilizing both random forest regression and classification techniques.

For regression, we will be working with data that contains employees’ salaries based on their positions. We will use this to predict an employee’s salary based on their position.

Let us take care of the libraries and the data:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv(‘Salaries.csv')
df.head()

X =df.iloc[:, 1:2].values
y =df.iloc[:, 2].values

As the dataset is very small, we won’t be able to split it. We will proceed directly to fitting the data.

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators = 10, random_state = 0)
model.fit(X, y)

Did you notice that we have made just 10 trees by putting n_estimators=10? It is up to you to play around with the number of trees. As it is a small dataset, 10 trees are enough.

Now, we will predict the salary of a person who has a level of 6.5

y_pred =model.predict([[6.5]])

After the prediction, we can see that the employee must get a salary of 167000 after reaching a level of 6.5. Let us visualise and interpret it better.

X_grid_data = np.arange(min(X), max(X), 0.01)
X_grid_data = X_grid.reshape((len(X_grid_data), 1))
plt.scatter(X, y, color = 'red')
plt.plot(X_grid_data,model.predict(X_grid_data), color = 'blue')
plt.title('Random Forest Regression’)
plt.xlabel('Position')
plt.ylabel('Salary')
plt.show()

Random Forest Regression in R

Now, we will be doing the same model in R and see how it creates an impact on prediction.

We will first import the dataset:

df = read.csv('Position_Salaries.csv')
df = df[2:3]

In R, we won’t perform splitting as the data is too small. We will use the entire data for training and make an individual prediction as we did in Python.

We will use the ‘randomForest’ library. If you did not install the package, the code below will assist you.

install.packages('randomForest')
library(randomForest)
set.seed(1234)

The seed function will help you get the same result that we got during training and testing.

model= randomForest(x = df[-2],
                         y = df$Salary,
                         ntree = 500)

Now, we will predict the salary of a level 6.5 employee and compare it to the one predicted using Python.

y_prediction = predict(model, data.frame(Level = 6.5))

As we can see, the prediction yields a salary of 160908, but in Python, we obtain a prediction of 167000. It entirely depends on the data analyst to decide which algorithm works better. We are done with the prediction. Now it’s time to visualize the data.

install.packages('ggplot2')
library(ggplot2)
x_grid_data = seq(min(df$Level), max(df$Level), 0.01)
ggplot()+geom_point(aes(x = df$Level, y = df$Salary),colour = 'red') +geom_line(aes(x = x_grid_data, y = predict(model, newdata = data.frame(Level = x_grid_data))),colour = 'blue') +ggtitle('Truth or Bluff (Random Forest Regression)') +  xlab('Level') + ylab('Salary')

So this is for regression using R. Now let us quickly move to the classification part to see how Random Forest works.

Random Forest Classifier in Python

For classification, we will utilize Social Networking Ads data, which contains information about products purchased based on a person’s age and salary. Let us import the libraries.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Now let us see the dataset:

df = pd.read_csv('Social_Network_Ads.csv')
df

For your information, the dataset contains 400 rows and 5 columns.

X = df.iloc[:, [2, 3]].values
y = df.iloc[:, 4].values

Now, we will split the data into training and testing sets. We will allocate 75% for training and the remaining 25% for testing.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

Now, we will standardise the data using StandardScaler from the sklearn library.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

After scaling, let’s examine the head of the data now.

Now, it’s time to fit our model.

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
model.fit(X_train, y_train)

We have made 10 trees and used the criterion ‘entropy ’ as it is used to decrease the impurity in the data. You can increase the number of trees if you wish, but we are keeping it limited to 10 for now.
Now the fitting is over. We will predict the test data.

y_prediction = model.predict(X_test)

After prediction, we can evaluate by the confusion matrix and see how well our model performs.

from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(y_test, y_prediction)

Great. As we can see, our model is performing well, with a significantly low rate of misclassification, which is interesting. Now, let’s visualize our training results.

from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1,X2,model.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Random Forest Classification (Training set)')
plt.xlabel('Age')
plt.ylabel('Salary')
plt.legend()
plt.show()

Now let us visualize test results in the same way.

from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1,X2,model.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),alpha=0.75,cmap= ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Random Forest Classification (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

So that’s for now. We will move on to performing the same model in R.

Random Forest Classifier in R

Let us import the dataset and check the head of the data

df = read.csv('SocialNetwork_Ads.csv')
df = df[3:5]

Now in R, we need to change the class to factor. So, we need further encoding.

df$Purchased = factor(df$Purchased, levels = c(0, 1))

Now, we will split the data and see the result. The splitting ratio will be the same as we did in Python.

install.packages('caTools')
library(caTools)
set.seed(123)
split_data = sample.split(df$Purchased, SplitRatio = 0.75)
training_set = subset(df, split_data == TRUE)
test_set = subset(df, split_data == FALSE)

Additionally, we will standardize the data and assess its performance during testing.

training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])

Now, we fit the model using the built-in library ‘randomForest’ provided by R.

install.packages('randomForest')
library(randomForest)
set.seed(123)
model= randomForest(x = training_set[-3],
                          y = training_set$Purchased,
                          ntree = 10)

We set the number of trees to 10 to see how it performs. We can set any number of trees to improve accuracy.

y_prediction = predict(model, newdata = test_set[-3])

Now the prediction is over, and we will evaluate using a confusion matrix.

conf_mat = table(test_set[, 3], y_prediction)
conf_mat

As we see, the model underperforms compared to Python as the rate of misclassification is high.

Now, let us interpret our result using visualization. We will be using the ElemStatLearn method for smooth visualization.

library(ElemStatLearn)
train_set = training_set
X1 = seq(min(train_set [, 1]) - 1, max(train_set [, 1]) + 1, by = 0.01)
X2 = seq(min(train_set [, 2]) - 1, max(train_set [, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(model, grid_set)
plot(set[, -3],
     main = 'Random Forest Classification (Training set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(train_set, pch = 21, bg = ifelse(train_set [, 3] == 1, 'green4', 'red3'))

The model works fine, as is evident from the visualisation of training data. Now, let us see how it performs with the test data.

library(ElemStatLearn)
testset = test_set
X1 = seq(min(testset [, 1]) - 1, max(testset [, 1]) + 1, by = 0.01)
X2 = seq(min(testset [, 2]) - 1, max testset [, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(model, grid_set)
plot(set[, -3], main = 'Random Forest Classification (Test set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(testset, pch = 21, bg = ifelse(testset [, 3] == 1, 'green4', 'red3'))

That’s it for now. The test data just worked fine, as expected.

Random Forest Regression in Python

For regression, we will be dealing with data that contains employees’ salaries based on their position. We will use this to predict an employee’s salary based on their position.

Let us take care of the libraries and the data:

# Import necessary libraries

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error

# Generating a simple regression dataset

X, y = make_regression(n_samples=100, n_features=1, noise=0.1, random_state=42)

# Splitting the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initializing the Random Forest Regressor

rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)

# Fitting the model to the training data

rf_regressor.fit(X_train, y_train)

# Predicting the results on the test set

y_pred = rf_regressor.predict(X_test)

# Calculating the Mean Squared Error of the predictions

mse = mean_squared_error(y_test, y_pred)

# Printing the Mean Squared Error

print(f"Mean Squared Error: {mse}")

# Plotting the results

plt.scatter(X_test, y_test, color='blue', label='Actual data')
plt.scatter(X_test, y_pred, color='red', label='Predicted data')
plt.title("Random Forest Regression - Actual vs Predicted")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.show()

Challenges and Limitations of the Random Forest Algorithm

However, despite the advantages of Random Forest, there are challenges and limitations to consider:

Interpretability:

Random Forest is considered a black-box model, making it hard to interpret feature contributions to predictions.
Lack of transparency can be problematic when model explainability is crucial.

Computational Cost:

It can be computationally expensive, especially with large datasets and many trees.
Requires significant computational power, particularly for real-time applications.

Memory Usage:

Storing multiple decision trees increases memory usage.
GPUs may face memory limitations due to this.

Sensitivity to Class Imbalance:

Sensitive to imbalanced classes, often biased toward the majority class.
Resampling or using class weights can help mitigate this issue.

To read more about the Key Python Libraries, read our comprehensive guide, “Key Python Libraries for Data Science and Analysis”

Conclusion

The Random Forest is a widespread model-based machine learning approach. It offers high accuracy, robustness, and resilience to overfitting by averaging the predictions of multiple decision trees. The Random Forest algorithm is widely used for both classification and regression tasks due to its multiple advantages, including simplicity of use, handling missing data, and feature importance analysis.

Although Random Forest is strong and popular among data scientists, it is not without its challenges, such as interpretability and computational complexity. However, with accurate tuning and optimization, they can be fine-tuned to make it one of the best algorithms for solving complex problems in machine learning.

A variant of this algorithm, the Random Forest Classifier, is especially powerful for classification problems, where it can handle many dimensions and data with numerous noisy variables. Random Forest is one of the most important and widely used machine learning algorithms, and it remains one of the most trusted and favourite sets of tools in the toolbox of machine learning practitioners worldwide.

Random Forest Algorithm in Machine Learning

What is the Random Forest Algorithm?

How Does the Random Forest Algorithm Work?

1. Bootstrapping (Random Sampling with Replacement)

Example:

2. Random Feature Selection

Example:

3. Building Multiple Decision Trees

Example:

4. Making Predictions

Example for Classification:

Example for Regression:

Key Advantages of Random Forest:

Summary:

Example in Python (using sklearn):

Advantages of the Random Forest Algorithm

Applications of the Random Forest Algorithm

Random Forest Classifier: A Special Focus on Classification Tasks

Hyperparameter Tuning and Optimization

Applying Random Forest with Python and R

Random Forest Regression in R

Random Forest Classifier in Python

Random Forest Classifier in R

Random Forest Regression in Python

Challenges and Limitations of the Random Forest Algorithm

Conclusion

Difference Between Deep Learning (DL) and Machine Learning (ML)

What are Machine Learning Models?

Label Encoding in Python

A Complete understanding of LASSO Regression

Supervised vs Unsupervised Learning: What is the Difference?

What is Automated Machine Learning (AutoML)?