What is Vanishing Gradient Problem?

What is the Vanishing Gradient?

The vanishing gradient is a phenomenon in which the gradients used to update the weights in neural networks become very small, tending closely towards zero. This becomes a problem when dealing with deep networks that have multiple hidden layers.

In backpropagation, we pass the gradients of the loss function backward from the output layer all the way to the input layer, whereby in each layer, the gradients are multiplied by the derivatives of the activation functions.

For deep networks, this can lead to gradients that vanish as they propagate through multiple layers, especially if the activation function has a region with a low derivative. As these gradients tend to zero, the weights in earlier layers, closer to the input, settle, and the model cannot learn from the data effectively.

This causes the model, specifically the earlier layers of it, to become stuck at a certain point, preventing it from learning. Significantly curtailing or stopping training altogether can lead to a model with poor performance, even when data and large architectures are available.

The Vanishing Gradient Problem Explained

Vanishing Gradient Problem The vanishing gradient problem is one of the most significant issues in training deep neural networks. It affects the model’s ability to learn, enforcing a cap on the depth of networks that can be reasonably trained. The issue is largely prevalent in deep networks, or networks with multiple hidden layers.

The two primary reasons behind the problem are the activation functions, specifically the sigmoid function and the tanh function. The activation functions we used in early neural networks have areas where their derivatives become very small. During backpropagation, the derivatives get multiplied repeatedly, and the values of the gradients shrink and converge toward zero.

As the number of layers in the network continues to increase, the problem becomes defined more precisely. In deep nets, the gradients backpropagate through multiple layers, and as they arrive at previous layers, they’re often so small that they have little or no effect on the weight updates.

This causes the model not to learn the features represented by the first layers, which is very dangerous not only in image recognition but also in tasks like natural language processing, where the first layers can convey essential concepts related to the data.

The Role of Gradients in Neural Networks

The Role of Gradients in Neural Networks

Neural Networks:
- Consists of multiple layers of neurons that process and pass data.
- The goal is to set the correct weights and biases for accurate predictions or classifications.
Training with Backpropagation:
- Backpropagation is used to update weights and biases by propagating errors backwards through the network.
- Gradients are crucial in this process, as they help determine how much to adjust the weights based on prediction errors.
Understanding Gradients:
- A gradient indicates the direction and magnitude of weight adjustments required to minimize error.
- It is calculated as the change in the loss function with respect to the weights.
Iterative Process:
- The gradient-based updates are performed iteratively until the model’s weights are optimized for accurate predictions.
Vanishing Gradient Issue:
- If gradients become too small, learning slows down, leading to the vanishing gradient problem, which makes optimization harder.

The Impact of Vanishing Gradients on Deep Neural Networks

Kewan or No Learning:
- Vanishing gradients cause weights in earlier layers not to update correctly.
- This leads to failure in convergence or extremely slow convergence, preventing the model from learning effectively.
Training Deep Networks:
- Deep networks learn hierarchical representations, but vanishing gradients make it difficult for deeper layers to learn.
- As layers are added, gradients get smaller, leading to poor optimization and limited practical use in real-world applications.
Poor Feature Representation:
- When gradients shrink, earlier layers, which capture low-level features, cannot update their weights.
- This results in the loss of both local and global information, leading to weak feature representation and overall poor performance.
Difficulty with Nonlinear Activation Functions:
- Activation functions like sigmoid and tanh can worsen the vanishing gradient problem.
- Large or small input values squeeze gradients to very small values, exacerbating the issue.

Causes of Vanishing Gradients

Choice of Activation Function:
- Functions like sigmoid and tanh can cause vanishing gradients when inputs are too small.
- The derivatives of these functions approach zero for large or small input values, making gradients very small and hindering learning, especially in deeper layers.
Deep Neural Network Architectures:
- As more layers are added, gradients are multiplied by activation function derivatives at each layer.
- This results in exponentially smaller gradients as depth increases, making it harder for the network to learn.
Weight Initialization:
- Poor weight initialization can worsen vanishing gradients.
- Small initial weights result in small gradients, whereas improper initialization can lead to unstable weight conditions, thereby slowing down training.
Poorly Chosen Learning Rates:
- A high learning rate may cause the network to skip optimal weight values.
- A low learning rate results in extremely small updates, preventing the network from converging properly.

Rectified Linear Unit Activation Function (ReLU’s)

Demonstrating the vanishing gradient problem in TensorFlow

Creating the model

In the TensorFlow code that I am about to show you, we’ll create a seven-layer densely connected network (including the input and output layers) and use TensorFlow summary operations and TensorBoard visualization to visualize the gradients.

The code utilizes the TensorFlow layers (tf.layers) framework, which enables the quick and easy construction of networks. The data we will be training the network on is the MNIST handwritten digit recognition dataset that comes packaged with the TensorFlow installation. To create the dataset, we can run the following:

mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

The MNIST data can be extracted from this MNIST data set by calling mnist.train.next_batch(batch_size). In this case, we’ll just be looking at the training data, but you can also extract a test dataset from the same data.

In this example, I’ll be using the feed_dict methodology and placeholder variables to feed in the training data, which isn’t the optimal method, but it will do for these purposes.

Set up the data placeholders:

self.input_images = tf.placeholder(tf.float32, shape=[None, self._input_size])
self.labels = tf.placeholder(tf.float32, shape=[None, self._label_size])

The MNIST data input size (self._input_size) is equal to the 28 x 28 image pixels, i.e. 784 pixels. The number of associated labels, self._label_size, is equal to the 10 possible hand-written digit classes in the MNIST dataset.

We’ll be creating a slightly deep, fully connected network – a network with seven total layers, including input and output layers. To create these densely connected layers easily, we’ll be using TensorFlow’s handy tf.layers API and a simple Python loop like follows:

# create self._num_layers dense layers as the model
input = self.input_images
for i in range(self._num_layers - 1):
    input = tf.layers.dense(input, self._hidden_size, activation=self._activation,
                                    name='layer{}'.format(i+1))

First, the generic input variable is initialized to be equal to the input images (fed via the placeholder)

Next, the code runs through a loop where multiple dense layers are created, each named ‘layerX’ where X is the layer number.

The number of nodes in the layer is set equal to the class property self._hidden_size, and the activation function is also supplied via the property self._activation.

Next, we create the final output layer (you’ll note that the loop above terminates before it gets to creating the final layer), and we don’t supply an activation to this layer. In the tf.layers API, a linear activation (i.e. f(x) = x) is applied by default if no activation argument is supplied.

# don't supply an activation for the final layer - the loss definition will
# supply softmax activation. This defaults to a linear activation i.e. f(x) = x
logits = tf.layers.dense(input, 10, name='layer{}'.format(self._num_layers))

Next, the loss operation is set up and logged:

# use softmax cross entropy with logits - no need to apply softmax activation to
# logits
self.loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=logits,
                                                                             labels=self.labels))
# add the loss to the summary
tf.summary.scalar('loss', self.loss)

The loss used in this instance is the handy TensorFlow softmax_cross_entropy_with_logits_v2 (the original version is soon to be deprecated). This loss function will apply the softmax operation to the unactivated output of the network, then apply the cross entropy loss to this outcome.

After this loss operation is created, its output value is added to the tf.summary framework. This framework allows scalar values to be logged and subsequently visualized in the TensorBoard web-based visualization page. It can also log histogram information, along with audio and images – all of these can be observed through the aforementioned TensorBoard visualization.

Next, the program calls a method to log the gradients, which we will visualize to examine the vanishing gradient problem:

self._log_gradients(self._num_layers)

This method looks like the following:

def _log_gradients(self, num_layers):
    gr = tf.get_default_graph()
    for i in range(num_layers):
        weight = gr.get_tensor_by_name('layer{}/kernel:0'.format(i + 1))
        grad = tf.gradients(self.loss, weight)[0]
        mean = tf.reduce_mean(tf.abs(grad))
        tf.summary.scalar('mean_{}'.format(i + 1), mean)
        tf.summary.histogram('histogram_{}'.format(i + 1), grad)
        tf.summary.histogram('hist_weights_{}'.format(i + 1), grad)

In this method, the TensorFlow computational graph is first extracted so that weight variables can be called out of it. Then, a loop is entered to cycle through all the layers. For each layer, first, the weight tensor for the given layer is grabbed by the handy function get_tensor_by_name.

You will recall that each layer was named “layerX” where X is the layer number. This is supplied to the function, along with “/kernel:0” – this tells the function that we are trying to access the weight variable (also called a kernel) as opposed to the bias value, which would be “/bias:0”.

On the following line, the tf.gradients() function is used. This will calculate gradients of the form ∂y/∂x where the first argument supplied to the function is y and the second is x. In the gradient descent step, the weight update is made in proportion to ∂loss/∂W, so in this case, the first argument supplied to tf.gradients() is the loss, and the second is the weight tensor.

Next, the mean absolute value of the gradient is calculated and logged as a scalar in the summary. Next, histograms of the gradients and the weight values are also logged in the summary. The flow now returns to the primary method in the class.

self.optimizer = tf.train.AdamOptimizer().minimize(self.loss)
self.accuracy = self._compute_accuracy(logits, self.labels)
tf.summary.scalar('acc', self.accuracy)

The code above is reasonably standard TensorFlow usage – defining an optimizer, in this case the flexible and powerful AdamOptimizer(), and also a generic accuracy operation, the outcome of which is also added to the summary (see the Github code for the accuracy method called).

Finally, a summary merge operation is created, which will gather up all the summary data ready for export to the TensorBoard file whenever it is executed:

self.merged = tf.summary.merge_all()

An initialization operation is also created. Now all that is left is to run the main training loop.

Training the model

The main training loop of this experimental model is shown in the code below:

def run_training(model, mnist, sub_folder, iterations=2500, batch_size=30):
    with tf.Session() as sess:
        sess.run(model.init_op)
        train_writer = tf.summary.FileWriter(base_path + sub_folder,
                                             sess.graph)
        for i in range(iterations):
            image_batch, label_batch = mnist.train.next_batch(batch_size)
            l, _, acc = sess.run([model.loss, model.optimizer, model.accuracy],
                                 feed_dict={model.input_images: image_batch, model.labels: label_batch})
            if i % 200 == 0:
                summary = sess.run(model.merged, feed_dict={model.input_images: image_batch,
                                                            model.labels: label_batch})
                train_writer.add_summary(summary, i)
                print("Iteration {} of {}, loss: {:.3f}, train accuracy: "
                      "{:.2f}%".format(i, iterations, l, acc * 100))

This is a pretty standard TensorFlow training loop – however, one non-standard addition is the tf.summary.FileWriter() operation and its associated uses. This operation typically requires two arguments: the location to store the files and the session graph.

Note that it is a good idea to set up a separate subfolder for each of your TensorFlow runs when using summaries, as this enables better visualization and comparison of the various runs within TensorBoard.

Every 200 iterations, we run the merged operation, which is defined in the class instance model – as mentioned previously, this gathers up all the logged summary data ready for writing. The train_writer.add_summary() operation is then run on this output, which writes the data into the chosen location (optionally along with the iteration/epoch number).

The summary data can then be visualized using TensorBoard. To run TensorBoard, using the command prompt, navigate to the base directory where all the subfolders are stored, and run the following command:

tensorboard –log_dir=whatever_your_folder_path_is

Upon running this command, you will see startup information in the prompt, telling you the address to type into your browser and bring up the TensorBoard interface. Note that the TensorBoard page will update itself dynamically during training, so you can visually monitor the progress.

Now, to run this whole experiment, we can run the following code, which cycles through each of the activation functions:

scenarios = ["sigmoid", "relu", "leaky_relu"]
act_funcs = [tf.sigmoid, tf.nn.relu, tf.nn.leaky_relu]
assert len(scenarios) == len(act_funcs)
# collect the training data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
for i in range(len(scenarios)):
    tf.reset_default_graph()
    print("Running scenario: {}".format(scenarios[i]))
    model = Model(784, 10, act_funcs[i], 6, 10)
    run_training(model, mnist, scenarios[i])

This should be pretty self-explanatory. Three scenarios are investigated. A scenario is reviewed for each type of activation: sigmoid, ReLU, and Leaky ReLU. Note that, in this experiment, I’ve set up a densely connected model with 6 layers (including the output layer but excluding the input layer), each with a layer size of 10 nodes.

Solutions to the Vanishing Gradient Problem

Several strategies have been developed to mitigate the vanishing gradient problem, enabling the training of deeper networks:

ReLU Activation Function:
- ReLU returns the input value if it is positive and zero if it is negative.
- Unlike sigmoid and tanh, it doesn’t saturate in the positive domain, ensuring gradients don’t shrink to zero.
- Downside: The ‘dying ReLU’ problem, where neurons can get stuck and stop learning when the input is negative.
Leaky ReLU and Parametric ReLU:
- These variants solve the ‘dying ReLU’ issue by allowing small, non-zero gradients for negative inputs.
- Helps maintain neuron activity, especially in deep networks.
Weight Initialization Techniques:
- Xavier Initialization: Ensures weights have variance proportional to the number of input and output units, reducing small gradients.
- He Initialization: Optimized for ReLU networks, preventing vanishing gradients by using larger initial weights.
Batch Normalization:
- Normalizes the inputs of each layer, stabilizing the distribution of activations during training.
- Helps maintain a consistent range for activations, improving gradient flow and mitigating vanishing gradients.
Gradient Clipping:
- Limits the size of gradients during backpropagation to prevent them from becoming too large or too small.
- Ensures stable training by avoiding gradient explosion or vanishing.
Skip Connections and Residual Networks (ResNets):
- Skip connections allow gradients to bypass one or more layers, preventing them from becoming too small.
- Especially useful for training very deep networks with hundreds or thousands of layers, helping to alleviate vanishing gradients.

Conclusion

Overcoming the vanishing gradient problem, acting as a significant barrier for training deep NNs, primarily using the classical activation functions such as sigmoid or tanh. As neural networks become deeper, the gradients may shrink towards zero, and the network’s weights will not be updated correctly, leading to poor model performance.

However, this vanishing gradient problem has been ideally resolved due to the development of new activation functions, such as ReLU, and weight initialization methods, as well as by batch normalization. Furthermore, architectures such as ResNets utilize skip connections, which have successfully enabled more profound network training by ensuring gradients can flow through the network more smoothly.

If you want to learn about deep learning, it’s essential to understand the vanishing gradient problem and its solution. As this advances, neural networks can become more profound and complex, enabling them to solve increasingly complex issues in various fields.