Batch normalization
  1. What is Normalisation?
  2. How does Normalisation help?
  3. What is Batch Normalisation?
  4. Why do we use batch normalisation?
  5. Regularisation with batch normalisation
  6. Implementation using Keras

What is Normalisation?

Normalisation is a technique to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. This technique is generally applied as part of data preparation for machine learning and is necessary if various input features are in a different range of values.

For transforming the data to put all the data points on the same scale, we have two techniques,viz-normalisation and standardisation.

A normalisation process consists of scaling down numerical data to a scale from zero to one, the highest value being one and the lowest being zero.

On the other hand, standardisation consists of subtracting the mean of the dataset from each data point and then dividing that difference by the data set’s standard deviation. Thus, we get a distribution with a mean of zero and a standard deviation of one. In practice, this standardisation process is often just referred to as normalisation as well.

                            Z=(x-mean)/st. deviation

How does Normalisation help?

The training data-set might have some features with numerical data points with high values, and other features that might be very low values. For example, if one of the features is the length of the car in feet, and the other feature is the distance covered by the car. Obviously the latter has a higher range of values than the length of the car.

Without normalising the input data, the attributes with a higher range can intrinsically influence the result more due to its larger value. But this doesn’t necessarily mean it is more important as a predictor.

Also, the data points with large values in the non-normalised data set can cause instability in neural networks.This happens because the relatively large inputs cascade down through the layers in the network.This may cause imbalanced gradients, which may, therefore, cause the famous exploding gradient problem.

The exploding gradient is a problem when large error gradients accumulate at the earlier layers and result in very large updates to neural network model weights during training. Gradients are used during training to update the network weights, but this process works best when these updates are small and controlled.

Additionally, non-normalised data can significantly decrease our training speed. Normalizing the input place all of the data on the same scale and increase training speed as well as avoid the problem like exploding gradient because there won’t be a wide range between data points

What is Batch Normalisation?

In deep learning, rather than just performing normalisation once in the beginning, you’re doing it all over the network. This is called batch normalisation. The output from the activation function of a layer is normalised and passed as input to the next layer.

It is called “batch” normalisation because we normalise the selected layer’s values by using the mean and standard deviation (or variance) of the values in the current batch.

As normalisation squeezes the values in a certain range which is not always desirable, we apply two parameters (g, b) to the normalisation value. These parameters are learned the same way as other hyperparameters through backpropagation during the training process.

1Z=(x-mean) / st. deviationNormalise using the Mean and standard deviation of the  current batch
2Z*gMultiply the normalised distribution by g which is a scaling factor. This factor scales the distribution obtained after step 1
Adding ‘b’, which is a shifting factor to the resultant distribution. This factor shifts  the distribution obtained after step 2

The parameters  ‘g’ and ‘b’ are trainable which means that they become learned and optimised the same way as other hyperparameters through backpropagation during the training process. The values of these parameters are arbitrarily set and change along the course of training.

In some cases, batch normalisation may be used on the inputs of the layer before applying the activation function. It is more appropriate for the activation functions that may result in non-Gaussian distributions like the rectified linear activation function, the modern default for most network types.

It may be more appropriate to use batch normalisation after the activation function for s-shaped functions like the hyperbolic tangent and logistic function.

Why do we use batch normalisation?

Batch normalisation significantly decreases the time of training of neural networks by decreasing the internal covariate shift. To understand the internal covariate shift, let us see what is covariate shift

Consider a deep neural network that can detect cats. We train the network on only the images of black cats. As a result, this model won’t perform well on different coloured images of cats and probably won’t recognise a cat of colour other than black. The reason is the shift in the input distribution of the data. Although the training set and the prediction set are both images of cats they may differ in the data distribution. This is known as the covariate shift.

The internal covariate shift refers to the change in the distribution of the inputs to different layers. During training, each layer is trying to correct itself for the error made up during the forward propagation. But every single layer acts separately, trying to correct itself for the error made up.

For example, Initially let us suppose that the 2nd layer of a network maps an input X to the output Y. Also, the third layer maps the output of the 2nd layer i.e Y to the output Z. If the 2nd layer adjusts its weights and biases to correct itself for the error made up and changes its output to A ( with totally different distribution than Y ). Now, the 3rd layer has to map A to the output Z and all the updates it made to map Y to Z are of no use and it has to start all over again.

More specifically, due to changes in weights of previous layers, the distribution of input values for current layer changes, forcing it to learn from new “input distribution”.

Due to batch normalisation, the data distribution of the outputs of layers before and after adjusting its weights and biases do not change much. Therefore, for the example given above, the distribution of  Y and A does not differ much which doesn’t affect the training process of the following layer (third layer) much. It turns out that training a network is most efficient and faster when the distribution of inputs to each layer is similar.

“We apply Batch normalisation to the best-performing ImageNet classification network and show that we can match its performance using only 7% of the training steps, and can further exceed its accuracy by a substantial margin. “

Original BatchNorm Paper

Also as batch normalisation makes the network more stable during training. This may require the use of much larger than normal learning rates, which in turn may further speed up the learning process.

Regularisation with batch normalisation

In addition to speeding up the learning of neural networks, batch normalisation also provides a weak form of regularisation. As normalisation is not performed on the whole dataset and just on the mini-batch, this adds noise to the data and results in Regularisation.

However, batch normalisation provides only a weak regularisation, it must not be fully relied upon to avoid over-fitting. Yet, other regularisation could be reduced accordingly. Also, it provides regularisation only when the batch size is small.

Implementation Using Keras

In this section, we demonstrate the increase in training speed due to the use of batch normalisation. We train two identical networks on the same data, one with batch normalisation and others without it and compare their learning curves.

from keras.datasets import mnist
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Activation,Conv2D,MaxPool2D,Flatten
from keras.layers import Batchnormalisation
from keras.optimizers import SGD
from matplotlib import pyplot
import numpy as np

(trainX, trainy), (testX, testy) = mnist.load_data()

# to convert values from 0 to 255 into range 0 to 1.
trainX = np.expand_dims(trainX, axis=-1)
trainX = trainX.astype("float32") / 255.0


# define model with batch normalisation
model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu',input_shape=(28,28,1)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(10, activation='softmax'))
opt = SGD(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
# fit model
history =, trainy, epochs=20, verbose=1)
# define model without batch normalisation
model_1 = Sequential()
model_1.add(Conv2D(32, (3, 3), activation='relu',input_shape=(28,28,1)))
model_1.add(Conv2D(64, (3, 3), activation='relu'))
model_1.add(Dense(64, activation='relu'))
model_1.add(Dense(10, activation='softmax'))
opt1 = SGD(lr=0.001)
model_1.compile(loss='categorical_crossentropy', optimizer=opt1, metrics=['accuracy'])
# fit model
history1 =, trainy, epochs=20, verbose=1)
# plot history
pyplot.plot(history.history['loss'], label='with batch normalisation')
pyplot.plot(history1.history['loss'], label='without batch normalisation')
batch normalization

As we can see, the model with batch normalisation converged much faster as compared to the model without batch normalisation. Also, you may notice the final loss of the model with batch normalisation is quite less compared to the other model.

This brings us to the end of this article where we have learned about Batch normalisation and the benefits of using it.

If you wish to learn more about Python and the concepts of Machine Learning, upskill with Great Learning’s PG Program Artificial Intelligence and Machine Learning.



Please enter your comment!
Please enter your name here

4 × five =