The Vanishing Gradient Problem

Vanishing Gradient Problem
Methods proposed to overcome vanishing gradient problem
Residual neural networks
ResNet
Rectified Linear Unit Activation Function
Conclusion

Contributed by: Dinesh

Introduction to Vanishing Gradient Problem

In Machine Learning, the Vanishing Gradient Problem is encountered while training Neural Networks with gradient-based methods (example, Back Propagation). This problem makes it hard to learn and tune the parameters of the earlier layers in the network.

The vanishing gradients problem is one example of unstable behaviour that you may encounter when training a deep neural network.

It describes the situation where a deep multilayer feed-forward network or a recurrent neural network is unable to propagate useful gradient information from the output end of the model back to the layers near the input end of the model.

The result is the general inability of models with many layers to learn on a given dataset, or for models with many layers to prematurely converge to a poor solution.

Also Read: Top Deep Learning Interview Questions for 2020

Methods proposed to overcome vanishing gradient problem

Multi-level hierarchy
Long short – term memory
Faster hardware
Residual neural networks (ResNets)
ReLU

Residual neural networks (ResNets)

One of the newest and most effective ways to resolve the vanishing gradient problem is with residual neural networks, or ResNets (not to be confused with recurrent neural networks). It was noted before ResNets that a deeper network would have higher training error than the shallow network.

The weights of a neural network are updated using the backpropagation algorithm. The backpropagation algorithm makes a small change to each weight in such a way that the loss of the model decreases. How does this happen? It updates each weight such that it takes a step in the direction along which the loss decreases. This direction is nothing but the gradient of this weight (concerning the loss).

Using the chain rule, we can find this gradient for each weight. It is equal to (local gradient) x (gradient flowing from ahead),

Here comes the problem. As this gradient keeps flowing backwards to the initial layers, this value keeps getting multiplied by each local gradient. Hence, the gradient becomes smaller and smaller, making the updates to the initial layers very small, increasing the training time considerably. We can solve our problem if the local gradient somehow became 1.

ResNet

How can the local gradient be 1, i.e, the derivative of which function would always be 1? The Identity function!

As this gradient is back propagated, it does not decrease in value because the local gradient is 1.

The ResNet architecture, shown below, should now make perfect sense as to how it would not allow the vanishing gradient problem to occur. ResNet stands for Residual Network.

These skip connections act as gradient superhighways, allowing the gradient to flow unhindered. And now you can understand why ResNet comes in flavours like ResNet50, ResNet101 and ResNet152.

Also Read: What is Rectified Linear Unit (ReLU)?

Rectified Linear Unit Activation Function (ReLU’s)

Demonstrating the vanishing gradient problem in TensorFlow

Creating the model

In the TensorFlow code that I am about to show you, we’ll be creating a seven-layer densely connected network (including the input and output layers) and using the TensorFlow summary operations and TensorBoard visualization to see what is going on with the gradients. The code uses the TensorFlow layers (tf.layers) framework, which allows quick and easy building of networks. The data we will be training the network on is the MNIST hand-written digit recognition dataset that comes packaged up with the TensorFlow installation. To create the dataset, we can run the following:

mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

The MNIST data can be extracted from this mnist data set by calling mnist.train.next_batch(batch_size). In this case, we’ll just be looking at the training data, but you can also extract a test dataset from the same data. In this example, I’ll be using the feed_dict methodology and placeholder variables to feed in the training data, which isn’t the optimal method but it will do for these purposes.

Setup the data placeholders:

self.input_images = tf.placeholder(tf.float32, shape=[None, self._input_size])
self.labels = tf.placeholder(tf.float32, shape=[None, self._label_size])

The MNIST data input size (self._input_size) is equal to the 28 x 28 image pixels i.e. 784 pixels. The number of associated labels, self._label_size is equal to the 10 possible hand-written digit classes in the MNIST dataset.

We’ll be creating a slightly deep fully connected network – a network with seven total layers including input and output layers. To create these densely connected layers easily, we’ll be using TensorFlow’s handy tf.layers API and a simple Python loop like follows:

# create self._num_layers dense layers as the model
input = self.input_images
for i in range(self._num_layers - 1):
    input = tf.layers.dense(input, self._hidden_size, activation=self._activation,
                                    name='layer{}'.format(i+1))

First, the generic input variable is initialized to be equal to the input images (fed via the placeholder)

Next, the code runs through a loop where multiple dense layers are created, each named ‘layerX’ where X is the layer number.

The number of nodes in the layer is set equal to the class property self._hidden_size and the activation function is also supplied via the property self._activation.

Next we create the final output layer (you’ll note that the loop above terminates before it gets to creating the final layer), and we don’t supply an activation to this layer. In the tf.layers API, a linear activation (i.e. f(x) = x) is applied by default if no activation argument is supplied.

# don't supply an activation for the final layer - the loss definition will
# supply softmax activation. This defaults to a linear activation i.e. f(x) = x
logits = tf.layers.dense(input, 10, name='layer{}'.format(self._num_layers))

Next, the loss operation is setup and logged:

# use softmax cross entropy with logits - no need to apply softmax activation to
# logits
self.loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=logits,
                                                                             labels=self.labels))
# add the loss to the summary
tf.summary.scalar('loss', self.loss)

The loss used in this instance is the handy TensorFlow softmax_cross_entropy_with_logits_v2 (the original version is soon to be deprecated). This loss function will apply the softmax operation to the un-activated output of the network, then apply the cross entropy loss to this outcome. After this loss operation is created, it’s output value is added to the tf.summary framework. This framework allows scalar values to be logged and subsequently visualized in the TensorBoard web-based visualization page. It can also log histogram information, along with audio and images – all of these can be observed through the aforementioned TensorBoard visualization.

Next, the program calls a method to log the gradients, which we will visualize to examine the vanishing gradient problem:

self._log_gradients(self._num_layers)

This method looks like the following:

def _log_gradients(self, num_layers):
    gr = tf.get_default_graph()
    for i in range(num_layers):
        weight = gr.get_tensor_by_name('layer{}/kernel:0'.format(i + 1))
        grad = tf.gradients(self.loss, weight)[0]
        mean = tf.reduce_mean(tf.abs(grad))
        tf.summary.scalar('mean_{}'.format(i + 1), mean)
        tf.summary.histogram('histogram_{}'.format(i + 1), grad)
        tf.summary.histogram('hist_weights_{}'.format(i + 1), grad)

In this method, first the TensorFlow computational graph is extracted so that weight variables can be called out of it. Then a loop is entered into, to cycle through all the layers. For each layer, first the weight tensor for the given layer is grabbed by the handy function get_tensor_by_name. You will recall that each layer was named “layerX” where X is the layer number. This is supplied to the function, along with “/kernel:0” – this tells the function that we are trying to access the weight variable (also called a kernel) as opposed to the bias value, which would be “/bias:0”.

On the next line, the tf.gradients() function is used. This will calculate gradients of the form ∂y/∂x where the first argument supplied to the function is y and the second is x. In the gradient descent step, the weight update is made in proportion to ∂loss/∂W, so in this case the first argument supplied to tf.gradients() is the loss, and the second is the weight tensor.

Next, the mean absolute value of the gradient is calculated, and then this is logged as a scalar in the summary. Next, histograms of the gradients and the weight values are also logged in the summary. The flow now returns back to the main method in the class.

self.optimizer = tf.train.AdamOptimizer().minimize(self.loss)
self.accuracy = self._compute_accuracy(logits, self.labels)
tf.summary.scalar('acc', self.accuracy)

The code above is fairly standard TensorFlow usage – defining an optimizer, in this case the flexible and powerful AdamOptimizer(), and also a generic accuracy operation, the outcome of which is also added to the summary (see the Github code for the accuracy method called).

Finally a summary merge operation is created, which will gather up all the summary data ready for export to the TensorBoard file whenever it is executed:

self.merged = tf.summary.merge_all()

An initialization operation is also created. Now all that is left is to run the main training loop.

Training the model

The main training loop of this experimental model is shown in the code below:

def run_training(model, mnist, sub_folder, iterations=2500, batch_size=30):
    with tf.Session() as sess:
        sess.run(model.init_op)
        train_writer = tf.summary.FileWriter(base_path + sub_folder,
                                             sess.graph)
        for i in range(iterations):
            image_batch, label_batch = mnist.train.next_batch(batch_size)
            l, _, acc = sess.run([model.loss, model.optimizer, model.accuracy],
                                 feed_dict={model.input_images: image_batch, model.labels: label_batch})
            if i % 200 == 0:
                summary = sess.run(model.merged, feed_dict={model.input_images: image_batch,
                                                            model.labels: label_batch})
                train_writer.add_summary(summary, i)
                print("Iteration {} of {}, loss: {:.3f}, train accuracy: "
                      "{:.2f}%".format(i, iterations, l, acc * 100))

This is a pretty standard TensorFlow training loop (if you’re unfamiliar with this, see my TensorFlow T utorial) – however, one non-standard addition is the tf.summary.FileWriter() operation and its associated uses. This operation generally takes two arguments – the location to store the files and the session graph. Note that it is a good idea to set up a different sub folder for each of your TensorFlow runs when using summaries, as this allows for better visualization and comparison of the various runs within TensorBoard.

Every 200 iterations, we run the merged operation, which is defined in the class instance model – as mentioned previously, this gathers up all the logged summary data ready for writing. The train_writer.add_summary() operation is then run on this output, which writes the data into the chosen location (optionally along with the iteration/epoch number).

The summary data can then be visualized using TensorBoard. To run TensorBoard, using command prompt, navigate to the base directory where all the sub folders are stored, and run the following command:

tensorboard –log_dir=whatever_your_folder_path_is

Upon running this command, you will see startup information in the prompt which will tell you the address to type into your browser which will bring up the TensorBoard interface. Note that the TensorBoard page will update itself dynamically during training, so you can visually monitor the progress.

Now, to run this whole experiment, we can run the following code which cycles through each of the activation functions:

scenarios = ["sigmoid", "relu", "leaky_relu"]
act_funcs = [tf.sigmoid, tf.nn.relu, tf.nn.leaky_relu]
assert len(scenarios) == len(act_funcs)
# collect the training data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
for i in range(len(scenarios)):
    tf.reset_default_graph()
    print("Running scenario: {}".format(scenarios[i]))
    model = Model(784, 10, act_funcs[i], 6, 10)
    run_training(model, mnist, scenarios[i])

This should be pretty self-explanatory. Three scenarios are investigated – a scenario for each type of activation reviewed: sigmoid, ReLU and Leaky ReLU. Note that, in this experiment, I’ve set up a densely connected model with 6 layers (including the output layer but excluding the input layer), with each having a layer size of 10 nodes.

Analyzing the results

The first figure below shows the training accuracy of the network, for each of the activations:

Accuracy of the three activation scenarios – sigmoid (blue), ReLU (red), Leaky ReLU (green)

As can be observed, the sigmoid (blue) significantly under performs the ReLU and Leaky ReLU activation functions. Is this due to the vanishing gradient problem? The plots below show the mean absolute gradient logs during training, again for the three scenarios:

Three scenario mean absolute gradients – output layer (6th layer) – sigmoid (blue), ReLU (red), Leaky ReLU (green)

Three scenarios mean absolute gradients – 1st layer – sigmoid (blue), ReLU (red), Leaky ReLU (green)

The first graph shows the mean absolute gradients of the loss concerning the weights for the output layer, and the second graph shows the same gradients for the first layer, for all three activation scenarios. First, the overall magnitudes of the gradients for the ReLU activated networks are significantly greater than those in the sigmoid activated network. It can also be observed that there is a significant reduction in the gradient magnitudes between the output layer (layer 6) and the first layer (layer 1). This is the vanishing gradient problem.

You may be wondering why the ReLU activated networks still experience a significant reduction in the gradient values from the output layer to the first layer – weren’t these activation functions, with their gradients of 1 for activated regions, supposed to stop vanishing gradients? Yes and no. The gradient of the ReLU functions where x > 0 is 1, so there is no degradation in multiplying 1’s together. However, the “chaining” expression I showed previously describing the vanishing gradient problem, i.e.:

∂C∂Wl∝f′(z(l))f′(z(l+1))f′(z(l+2))…

isn’t quite the full picture. Rather, the back-propagation product is also in some sense proportional to the values of the weights in each layer, so more completely, it looks something like this:

∂C∂Wl∝f′(z(l))⋅Wl⋅f′(z(l+1))⋅Wl+1⋅f′(z(l+2))⋅Wl+2…

If the weight values are consistently < 0, then we will also see a vanishing of gradients, as the chained expression will reduce through the layers as the weight values < 0 are multiplied together. We can confirm that the weight values in this case are < 0 by checking the histogram that was logged for the weight values in each layer:

The diagram above shows the histogram of layer 4 weights in the leaky ReLU scenario as they evolve through the epochs (y-axis) – this is a handy visualization available in the TensorBoard panel. Note that the weights are consistently < 0, and therefore we should expect the gradients to reduce even under the ReLU scenarios. In saying all this, we can observe that the degradation of the gradients is significantly worse in the sigmoid scenario than the ReLU scenarios. The mean absolute weight reduces by a factor of 30 between layer 6 and layer 1 for the sigmoid scenario, compared to a factor of 6 for the leaky ReLU scenario (the standard ReLU scenario is pretty much the same). Therefore, while there is still a vanishing gradient problem in the network presented, it is greatly reduced by using the ReLU activation functions. This benefit can be observed in the significantly better performance of the ReLU activation scenarios compared to the sigmoid scenario. Note that, at least in this example, there is not an observable benefit of the leaky ReLU activation function over the standard ReLU activation function.

Conclusion

This post has shown you how the vanishing gradient problem comes about when using the old canonical sigmoid activation function. However, the problem can be reduced using the ReLU family of activation functions. You will also have seen how to log summary information in TensorFlow and plot it in TensorBoard to understand more about your networks. Hope the article helps.

If you found this helpful and wish to learn more such concepts, join Great Learning Academy’s free online courses today.