Understanding the Ensemble method Bagging and Boosting

Bias and Variance
Ensemble Methods
Bagging
Boosting
Bagging vs Boosting
Implementation

Bias and Variance

For the same set of data, different algorithms behave differently. For example, if we want to predict the price of houses given for some dataset, some of the algorithms that can be used are Linear Regression and Decision Tree Regressor. Both of these algorithms will interpret the dataset in different ways and thus make different predictions. One of the key distinctions is how much bias and variance they produce.

There are 3 types of prediction error: bias, variance, and irreducible error. Irreducible error, also known as “noise,” can’t be reduced by the choice of algorithm. The other two types of errors, however, can be reduced because they stem from your algorithm choice.

“… bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting). ”
Wikipedia

Bias is an assumption made by a model to make the target function easier to learn. Models with high bias are less flexible and are not fully able to learn from the training data.

In the given figure, we use a linear model, such as linear regression, to learn from the model. As we can see, the regression line fails to fit the majority of the data points and thus, this model has high bias and low learning power. Generally, models with low bias are preferred.

“… variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting). ”
Wikipedia

Variance defines the deviation in prediction when switching from one dataset to another. In other words, it defines how much the predictions of a model will change from one dataset to another. It can also be defined as the amount that the estimate of the target function will change if different training data is used.

In the given figure, we can see a non-linear model such as SVR (Support Vector Regressor) tries to generate a polynomial function that passes through all the data points. This may seem like the perfect model, but such models are not able to generalize the data well and perform poorly on data that has not been seen before. Ideally, we want a model with low variance.

But there seem to be tradeoffs between the bias and variance. This is known as a bias-variance tradeoff. Hence when we decrease one, the other increases, and vice versa.

Ensemble Methods

The general principle of an ensemble method in Machine Learning to combine the predictions of several models. These are built with a given learning algorithm in order to improve robustness over a single model. Ensemble methods can be divided into two groups:

Parallel ensemble methods: In these methods, the base learners are generated in parallel simultaneously. For example, when deciding the movie you want to watch, you may ask multiple friends for suggestions and probably watch the movie which got the highest votes.
Sequential ensemble methods: In this technique, different learners learn sequentially with early learners fitting simple models to the data. Then the data is analyzed for errors. The goal is to solve for net error from the prior model. The overall performance can be boosted by weighing previously mislabeled examples with higher weight.

Most ensemble methods use a single base learning algorithm to produce homogeneous base learners, i.e. learners of the same type, leading to homogeneous ensembles. For example, Random forests (Parallel ensemble method) and Adaboost(Sequential ensemble methods).

Some methods use heterogeneous learners, i.e. learners of different types. This leads to heterogeneous ensembles. For ensemble methods to be more accurate than any of its members, the base learners have to be as accurate and as diverse as possible. In Scikit-learn, there is a model known as a voting classifier. This is an example of heterogeneous learners.

Bagging

Bagging, a Parallel ensemble method (stands for Bootstrap Aggregating), is a way to decrease the variance of the prediction model by generating additional data in the training stage. This is produced by random sampling with replacement from the original set. By sampling with replacement, some observations may be repeated in each new training data set. In the case of Bagging, every element has the same probability to appear in a new dataset. By increasing the size of the training set, the model’s predictive force can’t be improved. It decreases the variance and narrowly tunes the prediction to an expected outcome.

These multisets of data are used to train multiple models. As a result, we end up with an ensemble of different models. The average of all the predictions from different models is used. This is more robust than a model. Prediction can be the average of all the predictions given by the different models in case of regression. In the case of classification, the majority vote is taken into consideration.

For example, Decision tree models tend to have a high variance. Hence, we apply bagging to them. Usually, the Random Forest model is used for this purpose. It is an extension over-bagging. It takes the random selection of features rather than using all features to grow trees. When you have many random trees. It’s called Random Forest.

Boosting

Boosting is a sequential ensemble method that in general decreases the bias error and builds strong predictive models. The term ‘Boosting’ refers to a family of algorithms which converts a weak learner to a strong learner.

Boosting gets multiple learners. The data samples are weighted and therefore, some of them may take part in the new sets more often.

In each iteration, data points that are mispredicted are identified and their weights are increased so that the next learner pays extra attention to get them right. The following figure illustrates the boosting process.

During training, the algorithm allocates weights to each resulting model. A learner with good prediction results on the training data will be assigned a higher weight than a poor one. So when evaluating a new learner, Boosting also needs to keep track of learner’s errors.

Some of the Boosting techniques include an extra-condition to keep or discard a single learner. For example, in AdaBoost an error of less than 50% is required to maintain the model; otherwise, the iteration is repeated until achieving a learner better than a random guess.

Bagging vs Boosting

There’s no outright winner, it depends on the data, the simulation, and the circumstances. Bagging and Boosting in machine learning decrease the variance of a single estimate as they combine several estimates from different models. As a result, the performance of the model increases, and the predictions are much more robust and stable.

But how do we measure the performance of a model? One of the ways is to compare its training accuracy with its validation accuracy which is done by splitting the data into two sets, viz- training set and validation set.

The model is trained on the training set and evaluated on the validation set. Thus, the training accuracy is evaluated on the training set and gives us a measure of how good the model can fit the training data. On the other hand, validation accuracy is evaluated on the validation set and reveals the generalization ability of the model. A model’s ability to generalize is crucial to the success of a model. Thus, we can say that the performance of a model is good if it can fit the training data well and also predict the unknown data points accurately.

If a single model gets a low performance, Bagging rarely gets a better bias. However, Boosting can generate a combined model with lower errors. As it optimizes the advantages and reduces the pitfalls of the single model. On the other hand, Bagging can increase the generalization ability of the model and help it better predict the unknown samples. Let us see an example of this in the next section.

Implementation

In this section, we demonstrate the effect of Bagging and Boosting on the decision boundary of a classifier. Let us start by introducing some of the algorithms used in this code.

Decision Tree Classifier: Decision Tree Classifier is a simple and widely used classification technique. It applies a straightforward idea to solve the classification problem. Decision Tree Classifier poses a series of carefully crafted questions about the attributes of the test record. Each time it receives an answer, a follow-up question is asked until a conclusion about the class label of the record is reached.
Decision Stump: A decision stump is a machine learning model consisting of a one-level decision tree. That is, it is a decision tree with one internal node (the root) which is immediately connected to the terminal nodes (its leaves). A decision stump makes a prediction based on the value of just a single input feature. Here we take decision stump as a weak learner for the AdaBoost algorithm.
RandomForest: Random forest is an ensemble learning algorithm that uses the concept of Bagging.
AdaBoost: AdaBoost, short for Adaptive Boosting, is a machine learning meta-algorithm that works on the principle of Boosting. We use a Decision stump as a weak learner here.

Here is a piece of code written in Python which shows

How Bagging decreases the variance of a Decision tree classifier and increases its validation accuracy
How Boosting increases the bias of a Decision stump and increases its training accuracy

import matplotlib.pyplot as plt
#to plot decision boundary import mlxtend
from mlxtend.plotting import plot_decision_regions
import matplotlib.gridspec as gridspec
import itertools
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from mlxtend.preprocessing import shuffle_arrays_unison

# Loading some example data
iris = datasets.load_iris()
X, y = iris.data[:, [0,2]], iris.target
X, y = shuffle_arrays_unison(arrays=[X, y], random_seed=3)
# split data into  training and validation set
X_train, y_train = X[:100], y[:100]
X_test, y_test = X[100:], y[100:]

# define arrangement of subplots
gs = gridspec.GridSpec(2, 2)
fig = plt.figure(figsize=(10,8))

labels = ['DecisionTreeClassifier', 'Random Forest', 'Decision Stump','AdaBoost']
for clf, lab, grd in zip([DecisionTreeClassifier(),RandomForestClassifier(100,max_features=2,max_leaf_nodes=3,min_samples_split=5) 
                          ,DecisionTreeClassifier(max_depth=1),AdaBoostClassifier(DecisionTreeClassifier(max_depth=1))],labels,
                         itertools.product([0, 1], repeat=2)):

    clf.fit(X_train, y_train)
    print(clf.score(X_train, y_train))
    print(clf.score(X_test,y_test))
    ax = plt.subplot(gs[grd[0], grd[1]])
    fig = plot_decision_regions(X=X, y=y, clf=clf, legend=2)
    plt.title(lab)

plt.show()

Output:

Lets elaborate on these results-

Decision Tree classifier fits the training data perfectly well but does not score as well in validation accuracy.
Random Forest can fit the training data decently and outperforms the decision tree in the validation accuracy which implies that Random Forest is better able to generalize the data.
Decision stump performed very poorly, both in training and validation accuracy which means it is not able to fit the training data well.
AdaBoost increases the prediction power of stump and increases the training as well as validation accuracy.

From the above plot, we see that the RandomForest algorithm softens the decision boundary, hence decreases the variance of the decision tree model whereas AdaBoost fits the training data in a better way and hence increases the bias of the model.

This brings us to the end of this article. We have learned about Bagging and Boosting techniques to increase the performance of a Machine learning model.

If you wish to learn more about Python and the concepts of Machine learning, upskill with Great Learning’s PG Program Artificial Intelligence and Machine Learning.