One Hidden Layer NN

We will build a shallow dense neural network with one hidden layer, and the following structure is used for illustration purpose.

Before trying to understand this post, I strongly suggest you to go through my pervious implementation of logistic regression, as logistic regression can be seem as a 1-layer neural network and the basic concept is actually the same.

None

Where in the graph above, we have an input vector x = (x_1, x_2), containing 2 features and 4 hidden units a1, a2, a3 and a4, and output one value y_1 in [0, 1].(consider this a binary classification task with a prediction of probability)

In each hidden unit, take a_1 as example, a linear operation followed by an activation function is conducted. So given input x = (x_1, x_2), inside node a_1, we have:

None

Here w_{11} denotes weight 1 of node 1, w_{12} denotes weight 2 of node 1. Same for node a_2, it would have:

None

And same for a_3 and a_4 and so on …

Vectorization of One Input

Now let's put the weights into matrix and input into a vector to simplify the expression.

None

Here we've assumed that the second activation function to be tanh and the output activation function to be sigmoid (note that superscript [i] denotes the ith layer).

For the dimension of each matrix, we have:

None

The loss function L for a single value would be the same as logistic regression's (detail introduced here).

Function tanh and sigmoid looks as below.

None

Notice that the only difference of these functions is the scale of y.

Formula of Batch Training

The above shows the formula of a single input vector, however in actual training processes, a batch is trained instead of 1 at a time. The change applied in the formula is trivial, we just need to replace the single vector x with a matrix X with size n x m, where n is number of features and m is the the batch size — samples are stacked column wise, and the following result matrix are applied likewise.

None

For the dimension of each matrix taken in this example, we have:

None

Same as logistic regression, for batch training, the average loss for all training samples.

This is all for the forward propagation. To activate our neurons to learn, we need to get derivative of weight parameters and update them use gradient descent.

But now it is enough for us to implement the forward propagation first.

Generate Sample Dataset

Here we generate a simple binary classification task with 5000 data points and 20 features for later model validation.

Weights Initialization

Our neural network has 1 hidden layer and 2 layers in total(hidden layer + output layer), so there are 4 weight matrices to initialize (W^[1], b^[1] and W^[2], b^[2]). Notice that the weights are initialized relatively small so that the gradients would be higher thus learning faster in the beginning phase.

Forward Propagation

Let's implement the forward process following equations (5) ~ (8).

Loss Function

Following equation (9), the loss of each batch can be calculated as following.

Back Propagation

Now it comes to the back propagation which is the key to our weights update. Given the loss function L we defined above, we have gradients as follows:

None

If you are confused that why the derivative of Z^[2] is as above, you can check here. In fact the last layer of our network is the same as logistic regression, so the derivative is inherited from there.

In equation (4), it is element-wise multiplication, and the gradient of tanh{x} is 1 — x². You can try to deduct the equation above by yourself, but I basically took it from internet.

Let's break down the shape of each element, given number of each layer equals (n_x, n_h, n_y) and batch size equals m:

None

Once we understand the formula, implementation should come with ease.

Batch Training

I have stacked each part into a class, so that it could train like a general package of python. In addition batch training is also implemented. To avoid redundancy, I didn't put it here, for detailed implementation, please check my git repo.

Let's see how our implemented NN performs on our dataset.

None

With 10 hidden neurons, our model is able to achieve 95.1% accuracy on test set, which is pretty good.

Now go ahead and try implement yourself, the process would really help you to gain a deeper understanding of general dense neural network.