The Vanishing Gradient Problem Demystified

Dive Deeper into Vanishing Gradient Problem and Solution

Muhammad Iqbal bazmi
5 min readJul 12, 2020

An Artificial Neural Network is the collection of Neurons which is arranged in the horizontal stacking of layers consisting of Input-Layer, Hidden-Layer, and Output-Layer. Each layer is a collection of neurons, right. The neurons in one layer are connected to the other layer of neurons in some fashion. See the picture below of basic neurons and neural networks(Artificial).

Single neuron(biological and artificial)
Artificial neural network

In this image, you can see the biological and artificial neuron which is the building block of Artificial Neural Networks. It(single neuron) helps to learn the part of whole learning.

In Artificial neuron, there are some input features which is being multiplied with some weights(weights describe the importance of the connection/feature of the neurons), summed up and passed through the activation function to produce the output.

This picture is the Network of Artificial neurons which is called an Artificial Neural Network(ANN).

One of the most mandatory and hard parts of the neural networks is the training.

We send the input features through the input layer and are further propagated through Hidden Layers after some computation and reach to the output layer, and we find the loss using the y_actual and y_predicted (ex. Mean Squared Error(MSE)).

Now, the main concern is to reduce the loss and maximize the accuracy of the model and generalize the Network to reduce overfitting, right. How do we do it?

Yeah, using Back-Propagation.

Backpropagation is nothing but updating the weights of Neural connections using the derivative of loss_function(L/f(x)) with respect to every neuron in the network.

It means what is the impact of changing the weights in the input layer on the loss

we find the derivative with respect to every neuron of loss and update the weight.

The problem comes when we find derivative and if the derivative is less than one.

We know that we use the chaining method to find the derivative of every neuron. The catch here is when we use the chaining method we are multiplying the derivatives, so if the derivatives of neurons are less then it’s multiplication will be even lesser and there will be no more changing in the weights for earlier layers means there will no lot of impact of changing of input layers to the loss.

There is no(less) changing in gradients called Vanishing Gradient. It means very slow learning and difficult to converge to the local minima. So, It is very hard to train Deep Neural Networks.

There are many reasons for Vanishing Gradient and Exploding Gradient.

1. Weight Initialization:

There is a huge impact of weight initialization on the derivative of loss. If the weight is very low it will lead to Vanishing Gradient problem. And If it is more then leads to Exploding Gradient problem.

There is drastically increasing in gradient called Exploding Gradient and It leads to unstable Network and no Learning completed.

And even it may become a very large number (like NaN).

2. Activation function:

There are 3 popular Activation functions.

  • Sigmoid
  • tanh
  • ReLU(Rectified Linear Unit)

2.1 Sigmoid:

Sigmoid takes input and squash it in the range of [0, 1] which frequently leads to the Vanishing gradient.

You can see that when the sigmoid is very less or very high then the gradient becomes very nearer to zero which leads to Vanishing the gradient.

2.2 tanh:

tanh is also squashing the input but the range is [-1, 1]. So sometimes it performs better than Sigmoid but still, we can’t overcome the problem of Vanishing Gradients.

Note: Vanishing Gradient was the main reason for the failure of Neural Network in the 1990s and 2000s. Scientist were not able to train Deeper Network.

So Scientist came up with the new Idea called ReLU(Rectified Linear Unit)

2.3 ReLU(Rectified Linear Unit):

ReLU is very less prone to Vanishing Gradient's problem. How?

It’s because the slope or gradient of the ReLU activation function is 1 if it is greater than 0. Sigmoid activation has a maximum slope of 0.25 and for tanh it is 0.5 which means during the backpropagation you are multiplying it with values less than 1, making the gradient smaller and smaller.

ReLU overcomes this problem by having a gradient slope of 1, so during the backpropagation, there are not gradients passed back that are progressively getting smaller and smaller. But instead, they are staying the same, which is how ReLU solves the Vanishing gradient problem.

One thing to note about ReLU however is that if you have a value less than 0, that neuron is dead, and the gradient passed back is 0, meaning that during backpropagation, you will have 0 gradients being passed back if you had a value less than 0.

An alternative is Leaky ReLU, which gives some gradient for values less than 0.

3. Batch Normalization:

Batch Normalization also provides robustness to the model to overcome the Vanishing Gradient issues.

I hope It will help you to understand Vanishing Gradient problem.

for any query, please drop the comment below. Thanks!

I will upload detailed video soon… See you in the video.

--

--

Muhammad Iqbal bazmi

A self-taught programmer, Data Scientist and Machine Learning Engineer