Overfitting and Underfitting in Machine Learning Algorithm

When you get into the depth of Data Science, you realize that there aren’t any complex ideas or programs but just a collection of simple building blocks. For example, a neural network may seem like a complex model, but in reality, it is only a combination of numerous smaller ideas. Rather than trying to learn everything at once, it is better to start from the basics of model building and go towards more complex problems while getting a taste of the basics.

This article walks you through the fundamentals of model building which are Overfitting and Underfitting. These are two models that are used to calculate the accuracy of a machine learning model.

Introduction

Whenever a dataset is worked on to predict or classify a problem, we first detect accuracy by applying a design model to the train set, then to the test set. If the accuracy is satisfactory, we increase or decrease the data feature in our machine learning model or select feature engineering or increase the accuracy of dataset prediction by applying feature engineering. But sometimes our model gives poor results. Why does this happen? And how do we deal with it? All of this can be explained with a brief study into the concept of Overfitting and Underfitting.

Why do our Models Perform Poor Sometimes?

Let’s take a deep dive into the graph given above:

When you take a look at the above graph, on the left side one can easily predict that the line does not cover all the points shown in the graph. Such a model tends to cause the phenomenon known as the underfitting of data. It is also called High Bias.

Contrarily, when you take a look at the graph on the right side, it shows that the predicted line covers all the points in the graph. In such a situation, you might think this is a good graph that covers all the points, but that’s not true. The predicted line of the given graph covers all points including those which are noise and outlier. Such a model is also responsible for predicting poor results due to its high complexity. This model is also known as the High Variance Model.

Now, take a look at the middle graph, it shows a well predicted line. It covers a major portion of the points in the graph while also maintaining the balance between bias and variance.

In machine learning, we predict and classify our data in a more generalized form. So, to solve the problem of our model, that is overfitting and underfitting, we have to generalize our model. Statistically speaking, it depicts how well our model fits datasets such that it gives accurate results.

Model Basics

In order to talk about underfitting vs overfitting, we need to start with the basics: what is a model? A model is simply a system for mapping inputs to outputs. For example, if we want to predict house prices, we could make a model that takes the square footage of a house and gives price as the output. A model presents a theory with a problem: there is some connection between the square footage and the price and we make a model learn that relationship. Models are useful because we can use them to predict the values of outputs for new data points, given the inputs.

A model learns relationships between the inputs, called features, and outputs, called labels, from a training dataset. During training, the model is given both the features and labels and learns how to map the former to the latter. A trained model is evaluated on a testing set, where we only give the features and it makes predictions. We compare the predictions with the known labels for the testing set to calculate accuracy. Models can take various shapes, from simple linear regressions to deep neural networks, but all supervised models are based on the fundamental idea of learning relationships between inputs and outputs from training data.

Also Read: Decision Tree Algorithm with examples

Training and Testing Data

To make a model, we first need data that has an underlying relationship. For this example, we will create our own simple dataset with x-values (features) and y-values (labels). An important part of our data generation is adding random noise to the labels. In any real-world process, whether natural or man-made, the data does not exactly fit a trend. There is noise, and other variables in the relationship that we can’t measure. In the house price example, the trend between area and price is linear, but the prices do not lie exactly on a line because of other factors influencing house prices.

Similarly, our data has a trend (which we call the true function) and random noise to make it more realistic. After creating the data, we split it into random training and testing sets. The model will attempt to learn the relationship on the training data and be evaluated on the test data. In this case, 70% of the data is used for training and 30% for testing. The following graph shows the data we will explore.

We can see that our data are distributed with some variation around the true function (a partial sine wave) because of the random noise we added. During training, we want our model to learn the true function without being “distracted” by the noise.

Example:

def train(train_features, test_features, train_labels, test_labels,
          num_epochs=1000):
    loss = gluon.loss.L2Loss()
    net = nn.Sequential()
    # Switch off the bias since we already catered for it in the polynomial
    # features
    net.add(nn.Dense(1, use_bias=False))
    net.initialize()
    batch_size = min(10, train_labels.shape[0])
    train_iter = d2l.load_array((train_features, train_labels), batch_size)
    test_iter = d2l.load_array((test_features, test_labels), batch_size,
                               is_train=False)
    trainer = gluon.Trainer(net.collect_params(), 'sgd',
                            {'learning_rate': 0.01})
    animator = d2l.Animator(xlabel='epoch', ylabel='loss', yscale='log',
                            xlim=[1, num_epochs], ylim=[1e-3, 1e2],
                            legend=['train', 'test'])
    for epoch in range(1, num_epochs+1):
        d2l.train_epoch_ch3(net, train_iter, loss, trainer)
        if epoch % 50 == 0:
            animator.add(epoch, (evaluate_loss(net, train_iter, loss),
                                 evaluate_loss(net, test_iter, loss)))
    print('weight:', net[0].weight.data().asnumpy())

Model Building

Choosing a model can seem intimidating, but a good rule is to start simple and then build your way up. The simplest model is a linear regression, where the outputs are a linearly weighted combination of the inputs. In our model, we will use an extension of linear regression called polynomial regression to learn the relationship between x and y. Polynomial regression, where the inputs are raised to different powers, is still considered a form of “linear” regression even though the graph does not form a straight line (this confused me at first as well!)The general equation for a polynomial is below/

Here, y represents the label and x is the feature. The beta terms are the model parameters which will be learned during training, and the epsilon is the error present in any model. Once the model has learned the beta values, we can plug in any value for x and get a corresponding prediction for y. A polynomial is defined by its order, which is the highest power of x in the equation. A straight line is a polynomial of degree 1 while a parabola has 2 degrees.

How to differentiate between Overfitting and Underfitting?

Solving the issue of bias and variance ultimately leads one to solve underfitting and overfitting. Bias is the reduced model complexity while variance is the increase in model complexity. As more and more parameters are added to a model, the complexity of the model rises and variance becomes our primary concern while bias steadily falls.

Before getting any further, let us distinguish between what bias variance is.

Bias: Bias is the measure of closeness of our predictive model to the training after averaging the predicted value. Generally, an algorithm that has a high bias helps our model learn fast and predict accurately but it is less flexible. Being less flexible means, the algorithm loses its ability to solve complex problems. This results in the underfitting of a model. In such a case, getting more training data won’t help your cause.

Variance: Variance defines the deviation in prediction when switching from one dataset to another. In other words, it defines how much the value of a model will change from one dataset to another. Ideally, there shouldn’t be any value change in a Machine Learning module no matter how many datasets we change, but that doesn’t happen and when there is a big change is the value of results from one dataset to another, that is due to high variance of the particular ML model. The Graph below shows the path when a learning algorithm suffers from High Variance. This shows how getting more training data will help to deal with a particular problem.

How do you overcome Overfitting and Underfitting in your ML Model?

1. Underfitting:

To solve the problem of Underfitting, we have to model the expected value of the target variable as nth degree polynomial yielding the general polynomial. The mathematical representation of the same can be given as follows:

The training error will tend to decrease as we increase the degree d of the polynomial. At the same time, the cross-validation error will also decrease to some extent and then increase after a set limit, making a convex curve on the graph. You can this through the graph given below.

The concept that we just implemented is called Polynomial Regression. Polynomial Regression is a part of Polynomial Projection which is built into scikit-learning using polynomial feature transform.

2. Overfitting

To solve the problem of Overfitting in our model we need to increase the flexibility of our module. Too much flexibility can also make the model redundant so we need to increase the flexibility in an optimum amount. This can be done using regularization techniques. There are namely 3 regularization techniques one can use, these are known as:

L1 regularization (Also known as Lasso regularization/penalization)
L2 regularization ( Also known as Ridge regularization/penalization)
Elastic net

1. L1 Regularization

L1 or Lasso Regularization stands for the least absolute shrinkage and selection operator. Mathematically it can be defined as:-

We need to reduce the overfitting of data and to do so the ‘P’ term should be added to our existing model and alpha is the learning rate. Lasso method overcomes the disadvantage of overfitting by not furnishing high value of the coefficient beta but setting them to 0 so that they are not relevant, therefore you might end with fewer features including the model you started with, which is a huge advantage.

2. L2 Regularization

Also known as Tikhonov Regularization, it is defined by the following mathematical representation:

The ‘P’ is the regularization added to the cost function. The importance of this regularization is that it enforces the Beta function to be lower, but it does not enforce it to become zero. This means it will not get rid of features that are not suitable but rather minimize their impact on the trained model.

3. Elastic

Working of the Elastic Regularization is such that it implements both Lasso and Ridge together. Hence it is better than both the models. The elastic solution path that is given is piecewise linear and the mathematical representation is given as:

Good fit in the statistical model:

Ideally, while making estimates with a model 0 error, the data will have a good fit. This can be achieved in a space between overfitting and underfitting. To understand it, we need to look at the performance of our model over time, which we learn from the training dataset.

As time goes on, our model continues to learn and the model’s error on training and test data continues to decrease. If it is learned for a long time, there is a greater chance of model overfitting due to noise and less useful details. This is why our model performance decreases. To get a good fit, we stop at one point before the error starts to grow. At this time the model is said to have good skills on the training dataset as well as on our unseen test dataset.

Also Read: Machine Learning Salary in India

Applying the machine learning model directly to the data-sets does not measure our accuracy as we expected, and it may have overfitting or underfitting in our training data. This blog gives you a short introduction on how to identify and fix overfitting and underfitting using the regularization technique. The Regularization Technique looks at the correct intuition of the work and how different our accuracy score is before and after applying it to the iris-data.

If you wish to upskill and learn more about this field, check out Great Learning’s PGP – Artificial Intelligence and Machine learning course!