Overfitting is a serious problem in neural networks. To understand Gaussian Dropout, we must first understand what overfitting means.
Contributed by: Ribha Sharma
What is overfitting?
When a model is good at classifying or predicting data in the train set but is not so good at classifying data on a test data set. It can be said, that the model has learned the features given in the training set very well, but, if the data slightly deviates from the previous data, then the model is unable to predict the output accurately.
Overfitting in neural networks is handled by various regularization techniques. Some of the famous regularization techniques are:
- Parameter Norm Penalties
- L2 Parameter Regularization
- L1 Regularization
- Dataset Augmentation
- Noise Robustness
- Semi-Supervised Learning
- Multi-Task Learning
- Early Stopping of Training
- Adversarial Training
This article will initially explain dropout, then we will dive into understanding the Gaussian dropout.
What is Dropout?
Dropout is an inexpensive technique used to reduce overfitting in the model. It will randomly ignore some subsets of nodes in a given layer during training, or you can say that it drops out the nodes from the layer. Hence, the name “Dropout”. This will prevent the dropped out nodes from participating or producing a prediction on the data. This technique may also help our model to generalize better to data it has never seen before and also prevents the units from co-adapting too much.
The choice of which node in each layer is to be dropped is completely random. Each node is retained with a fixed probability p which is independent of other nodes, where p can be chosen using a validation set or can simply be set at 0.5, which is considered to be the optimal value for a wide range of networks.
Depending on the type of network, one applies the dropout methods. The most common dropout methods are – Standard Dropout, DropConnect, Standout, Gaussian Dropout, Pooling Dropout, Spatial Dropout, Cutout, Max-Drop, RNNDrop, Recurrent Dropout, Variational RNN dropout and many more.
The above diagram shows some proposed methods and theoretical advances in dropout methods from 2012 to 2019. This article will emphasize Gaussian dropout.
Also Read: Types of Neural Networks
In dropout, the nodes are dropped during training which significantly thins the network, and it is difficult to average the predictions from exponentially thinned models. So, instead of dropping nodes during the training, the weights of each node is dropped by injecting noise during training time, thereby reducing the computational effort during testing time. This noise could be either Bernoulli’s noise or Gaussian noise. When the noise injected is Gaussian noise, the dropout method is called Gaussian Dropout.
Most of the dropout methods for DNNs are based on Bernoulli’s Gate, but some networks follow Gaussian distribution (Normal Distribution).
With a Gaussian-Dropout, the expected value of the activation remains unchanged. Therefore, unlike regular Dropout, no weight scaling is required during testing. In this method, all nodes are exposed to each iteration for each training sample. So, this avoids the slowdown during backpropagation and thereby increasing the execution time.
Mathematically, dropout is a method:
- where each node is retained with probability p at training time and the weights are scaled down by multiplying by a factor of p at test time
- or, where the retained nodes are scaled up by multiplying by 1/p at training time and weights are not modified at the test time, where p is the standard (Bernoulli) dropout.
In Gaussian dropout, the Bernoulli’s gate is replaced by Gaussian gate. Thus, dropout can be seen as multiplying each node by a [p (1−p)]. The Gaussian random variable gives the highest entropy while Bernoulli’s random variable gives the lowest. Srivastava et al.’s experiments suggest that the highest entropy works better.