**Activation Functions: Introduction****Properties of activation functions****Types of Activation Functions****Binary Step Function****Linear Activation Function****Non-Linear Activation Functions****Conclusion**

*Contributed by: Sreekanth LinkedIn Profile: https://www.linkedin.com/in/sreekanth-tadakaluru-3301649b/ *

**Introduction**

Activation functions are mathematical equations that determine the output of a neural network model. Activation functions also have a major effect on the neural network’s ability to converge and the convergence speed, or in some cases, activation functions might prevent neural networks from converging in the first place. Activation function also helps to normalize the output of any input in the range between 1 to -1 or 0 to 1.

Activation function must be efficient and it should reduce the computation time because the neural network sometimes trained on millions of data points.

Let’s consider the simple neural network model without any hidden layers.

Here is the output-

**Y = ∑ (weights*input + bias) **

and it can range from -infinity to +infinity. So it is necessary to bound the output to get the desired prediction or generalized results.

** Y = Activation function(∑ (weights*input + bias))**

So the activation function is an important part of an artificial neural network. They decide whether a neuron should be activated or not and it is a non-linear transformation that can be done on the input before sending it to the next layer of neurons or finalizing the output.

**Properties of activation functions**

- Non Linearity
- Continuously differentiable
- Range
- Monotonic
- Approximates identity near the origin

**Types of Activation Functions**

** **The activation function can be broadly classified into 2 categories.

- Binary Step Function
- Linear Activation Function

**Binary Step Function**

A binary step function is generally used in the Perceptron linear classifier. It thresholds the input values to 1 and 0, if they are greater or less than zero, respectively.

The step function is mainly used in binary classification problems and works well for linearly severable pr. It can’t classify the multi-class problems.

*Also Read: 3 Things to Know before deep diving into Neural Networks *

**Linear Activation Function**

The equation for Linear activation function is:

**f(x) = a.x **

When a = 1 then f(x) = x and this is a special case known as identity.

**Properties:**

- Range is -infinity to +infinity
- Provides a convex error surface so optimisation can be achieved faster
- df(x)/dx = a which is constant. So cannot be optimised with gradient descent

**Limitations:**

- Since the derivative is constant, the gradient has no relation with input
- Back propagation is constant as the change is delta x

**Non-Linear Activation Functions**

Modern neural network models use non-linear activation functions. They allow the model to create complex mappings between the network’s inputs and outputs, such as images, video, audio, and data sets that are non-linear or have high dimensionality.

Majorly there are 3 types of Non-Linear Activation functions.

- Sigmoid Activation Functions
- Rectified Linear Units or ReLU
- Complex Nonlinear Activation Functions

**Sigmoid Activation Functions**

Sigmoid functions are bounded, differentiable, real functions that are defined for all real input values, and have a non-negative derivative at each point.

**Sigmoid or Logistic Activation Function**

The sigmoid function is a logistic function and the output is ranging between 0 and 1.

The output of the activation function is always going to be in range (0,1) compared to (-inf, inf) of linear function. It is non-linear, continuously differentiable, monotonic, and has a fixed output range. But it is not zero centred.

**Hyperbolic Tangent**

The function produces outputs in scale of [-1, 1] and it is a continuous function. In other words, function produces output for every x value.

**Y = tanh(x)tanh(x) = (e ^{x} – e^{-x}) / (e^{x} + e^{-x})**

**Inverse Hyperbolic Tangent (arctanh)**

It is similar to sigmoid and tanh but the output ranges from [-pi/2,pi/2]

**Softmax**

The softmax function is sometimes called the soft argmax function, or multi-class logistic regression. This is because the softmax is a generalization of logistic regression that can be used for multi-class classification, and its formula is very similar to the sigmoid function which is used for logistic regression. The softmax function can be used in a classifier only when the classes are mutually exclusive.

**Gudermannian**

The Gudermannian function relates circular functions and hyperbolic functions without explicitly using complex numbers.

The below is the mathematical equation for Gudermannian function:

**GELU (Gaussian Error Linear Units)**

An activation function used in the most recent Transformers such as Google’s BERT and OpenAI’s GPT-2. This activation function takes the form of this equation:

**GELU(x)=0.5x(1+tanh(√2/π(x+0.044715×3)))**

So it’s just a combination of some functions (e.g. hyperbolic tangent tanh) and approximated numbers.

It has a negative coefficient, which shifts to a positive coefficient. So when x is greater than zero, the output will be x, except from when x=0 to x=1, where it slightly leans to a smaller y-value.

**Also Read: What is Recurrent Neural Network | Introduction of Recurrent Neural Network**

**Problems with Sigmoid Activation Functions**

**1. Vanishing Gradients Problem**

The main problem with deep neural networks is that the gradient diminishes dramatically as it is propagated backward through the network. The error may be so small by the time it reaches layers close to the input of the model that it may have very little effect. As such, this problem is referred to as the “vanishing gradients” problem.

A small gradient means that the weights and biases of the initial layers will not be updated effectively with each training session. Since these initial layers are often crucial to recognizing the core elements of the input data, it can lead to overall inaccuracy of the whole network.

**2. Exploding Gradients**

Exploding gradients are a problem where large error gradients accumulate and result in very large updates to neural network model weights during training. These large updates in turn results in an unstable network. At an extreme, the values of weights can become so large as to overflow and result in NaN values.

**Rectified Linear Units or ReLU**

The sigmoid and hyperbolic tangent activation functions cannot be used in networks with many layers due to the vanishing gradient problem. The rectified linear activation function overcomes the vanishing gradient problem, allowing models to learn faster and perform better. The rectified linear activation is the default activation when developing multilayer Perceptron and convolutional neural networks.

**Rectified Linear Units(ReLU) **

ReLU is the most commonly used activation function in neural networks and The mathematical equation for ReLU is:

**ReLU(x) = max(0,x)**

So if the input is negative, the output of ReLU is 0 and for positive values, it is x.

Though it looks like a linear function, it’s not. ReLU has a derivative function and allows for backpropagation.

There is one problem with ReLU. Let’s suppose most of the input values are negative or 0, the ReLU produces the output as 0 and the neural network can’t perform the back propagation. This is called the Dying ReLU problem. Also, ReLU is an unbounded function which means there is no maximum value.

Pros:

- Less time and space complexity
- Avoids the vanishing gradient problem.

Cons:

- Introduces the dead relu problem.
- Does not avoid the exploding gradient problem.

**Leaky ReLU**

The dying ReLU problem is likely to occur when:

- Learning rate is too high
- There is a large negative bias

Leaky ReLU is the most common and effective method to solve a dying ReLU problem. It adds a slight slope in the negative range to prevent the dying ReLU issue.

Again this doesn’t solve the exploding gradient problem.

**Parametric ReLU**

PReLU is actually not so different from Leaky ReLU.

So for negative values of x, the output of PReLU is alpha times x and for positive values, it is x.

Parametric ReLU is the most common and effective method to solve a dying ReLU problem but again it doesn’t solve exploding gradient problem.

**Exponential Linear Unit (ELU)**

ELU speeds up the learning in neural networks and leads to higher classification accuracies, and it solves the vanishing gradient problem. ELUs have improved learning characteristics compared to the other activation functions. ELUs have negative values that allow them to push mean unit activations closer to zero like batch normalization but with lower computational complexity.

The mathematical expression for ELU is:

ELU is designed to combine the good parts of ReLU and leaky ReLU and it doesn’t have the dying ReLU problem. it saturates for large negative values, allowing them to be essentially inactive.

**Scaled Exponential Linear Unit (SELU)**

SELU incorporates normalization based on the central limit theorem. SELU is a monotonically increasing function, where it has an approximately constant negative output for large negative input. SELU’s are mostly commonly used in Self Normalizing Networks (SNN).

The output of a SELU is normalized, which could be called internal normalization, hence the fact that all the outputs are with a mean of zero and standard deviation of one. The main advantage of SELU is that the Vanishing and exploding gradient problem is impossible and since it is a new activation function, it requires more testing before usage.

**Softplus or SmoothReLU**

The derivative of the softplus function is the logistic function.

The mathematical expression is:

And the derivative of softplus is:

**Swish function**

The Swish function was developed by Google, and it has superior performance with the same level of computational efficiency as the ReLU function. ReLU still plays an important role in deep learning studies even for today. But experiments show that this new activation function overperforms ReLU for deeper networks

The mathematical expression for Swish Function is:

The modified version of swish function is:

Here, β is a parameter that must be tuned. If β gets closer to ∞, then the function looks like ReLU. Authors of the Swish function proposed to assign β as 1 for reinforcement learning tasks.

**Conclusion **

In this blog, we tried explaining all the non linear activation functions with the mathematical expressions. If you found this helpful and wish to learn more such concepts, you can join Great Learning Academy’s free online courses today!

1