Deep Learning, Machine Learning

When you decided to learn deep learning, it is recommended to start with logistic regression because you can think about each neuron in a deep neural network as logistic regression (only if all layers use sigmoid activation function). The operation in each neuron is a dot product between the input vector and the weight vector, it has bias too, and then you apply activation function to it which is very similar to logistic regression. The question is what is the difference between logistic regression and neural network?

None
Image by author

The simple answer is:

  1. You can think about logistic regression as a single layer neural network with a sigmoid activation function.
  2. Let's say that you build a 5 layers neural network and all the layers use the sigmoid activation function, then each neuron in your neural network is logistic regression.
  3. Logistic regression doesn't have parallelism, which means it only a single neuron in a single layer.

Zero Initialization

The simplest way to initialize weight for logistic regression is by setting all the weight and bias to be zero, but why don't we do this for neural networks too?

None
Image by author

Suppose we use ReLU on hidden layers and sigmoid on the output layer, initializing the weight and bias to be zero will surely bring us to the dead neuron problem. For more about dead neurons please refer to Neural Network: The Dead Neuron.

In each iteration of gradient descent, we will find the gradient for the previous layer by multiplying the existing weight with a gradient obtained by backpropagation from the next layer. If the initial weight is zero, multiplying it by any gradient will set the gradient to be zero. Due to zero gradient, the gradient descent process won't change the weight which means each iteration has no effect on the weights we are trying to optimize. But if we set the bias to be a positive value, for example, one, the weights will change because ReLU neurons produce non-zero output but they're going to change in an undesirable way.

None
Image by author

Notice that every neuron in the same layer has the same behavior, and will be ended up having the same weight. This phenomenon is called the symmetric problem.

Constant Initialization

The same problem happens if the weights are initialized with a constant value for example all the weights are one and the biases are zero.

None
Image by author

As we can imagine all the weights being the same is a bad thing because that means each neuron in the same layer represents the same feature. Hence, adding more neurons in a layer won't increase the expressiveness of our neural network because such a layer as if it only has 1 neuron. The solution to address this problem is quite simple, just randomize the initial weight. For example, we random the weight and set the bias to be zero.

None
Image by author

Random initialization allows us to break this symmetry. This also allows us to make all the neurons in the neural network behave differently.

Now that we know we have to initialize the weights randomly, but is this enough to build a very deep neural network? Unfortunately not yet. Turns out random initialization is just not enough because with just randomly thrown numbers we might encounter either vanishing gradient or exploding gradient. Large weight leads us to exploding gradient problem, while small weight leads us to vanishing gradient problem. And another thing to know that every neuron will have a different output range since we just give it the weight without a specific range. And that's indeed not good for neural networks.

Random Normal

To have a specific range, we can initialize the weight with a random normal distribution. Instead of initializing weight with just a random number, by using the normal distribution, we can initialize the weight with a specific range by setting the mean and standard deviation. For example, setting the mean to be zero and a standard deviation of one will likely give us a number around -1 and 1.

Xavier Initialization

A better technique to initialize a neural network is to control the variance of the output. We want the output of a layer that is produced by neurons to follow the same distribution. Xavier initialization is the technique that initializes the weights in a way that the output produced by the neurons all follow the same distribution.

We know that the output of a neuron with a linear activation function is:

None
Image by author

Remember that the goal of Xavier initialization is to make the output of the neuron to follow the same distribution. This helps us keep the gradient from exploding or vanishing. It means that what we want is:

None
Image by author

Remember that b is a constant and has zero variance so we can just remove it. Computing the variance on the right side we get:

None
Image by author

E is the mean value. Assuming that we have normalized the input and the weight comes from the normal distribution with a mean of 0, we can remove the E term and we get:

None
Image by author

Substitute it with the original formula we get:

None
Image by author

We know that they are identically distributed, we can simplify the above equation and get:

None
Image by author

Here the term N is the dimension of the x vector. In order to satisfy var(y) = var(x) we have to make N * var(w) is equal to one.

None
Image by author

We get that var(w):

None
Image by author

Now we can initialize the weight by taking random numbers from a normal distribution with a mean of zero and standard deviation of the square root of 1/ N. In code, you can achieve this by:

np.random.randn(shape) * np.sqrt(1/N)

Another version of this formula is to use the fan-in and fan-out method. The fan-in is the number of inputs coming to a specific layer while the fan-out is the number of output from a specific layer. The formula changes to:

None
Image by author

In code:

np.random.randn(shape) * np.sqrt(2/fan_in + fan_out)

Conclusion

The point to take away from this article:

  1. Don't initialize the network with zero weight, in some cases the network might not learn at all.
  2. Don't initialize the network with constant weight, you will end up with a symmetric network.
  3. You can break the symmetry by initializing the weight randomly and keep the bias constant. But it is not enough, if the weight is too small it will bring you to the vanishing gradient problem if the weight is too large it will bring you to the exploding gradient problem. Instead, initialize the weight with a specific range.
  4. A better technique is to use Xavier initialization so that the variance across every layer is the same.

References

[1] Goodfellow et al., Deep Learning (2016). [2] Xavier and Yoshua., Understanding the difficulty of training deep feedforward neural networks (2010).