Deriving Backpropagation for Neural Networks

For the training process of a neural network, the backpropagation algorithm lies at its core. Based on the deviation between the predicted value $\hat{y}$ and the actual value $y$ , the network computes the gradients of the loss function with respect to each parameter from back to front, and then uses gradient descent to optimize and train the network’s parameters.

The computation graph of the neural network is shown below:

Computation graph for neural network backpropagation — The computation graph of the neural network: showing the gradient computation paths of each parameter in forward and backward propagation

From this computation graph, we can see that if we want to compute the network’s parameters $W^{[1]},b^{[1]},W^{[2]},b^{[2]}$ , we first need to compute $\frac{\partial L}{\partial a^{[2]}}$ and $\frac{\partial a^{[2]}}{\partial z^{[2]}}$ , and then use the chain rule to obtain $\frac{\partial L}{\partial z^{[2]}}=\frac{\partial L}{\partial a^{[2]}}\frac{\partial a^{[2]}}{\partial z^{[2]}}$ .

Next we compute $\frac{\partial z^{[2]}}{\partial W^{[2]}}$ and $\frac{\partial z^{[2]}}{\partial b^{[2]}}$ . Again, by the chain rule, we obtain $\frac{\partial L}{\partial W^{[2]}}=\frac{\partial L}{\partial z^{[2]}}\frac{\partial z^{[2]}}{\partial W^{[2]}}$ as well as $\frac{\partial L}{\partial b^{[2]}}=\frac{\partial L}{\partial z^{[2]}}\frac{\partial z^{[2]}}{\partial b^{[2]}}$ . This gives us $dW^{[2]}$ and $db^{[2]}$ .

In addition, to compute $dW^{[1]}$ and $db^{[1]}$ , we first need to compute $\frac{\partial z^{[1]}}{\partial W^{[1]}}$ , $\frac{\partial a^{[1]}}{\partial z^{[1]}}$ and $\frac{\partial z^{[2]}}{\partial a^{[1]}}$ . Again, by the chain rule, we obtain $\frac{\partial L}{\partial W^{[1]}}=\frac{\partial L}{\partial z^{[2]}}\frac{\partial z^{[2]}}{\partial a^{[1]}}\frac{\partial a^{[1]}}{\partial z^{[1]}}\frac{\partial z^{[1]}}{\partial W^{[1]}}$ , as well as $\frac{\partial L}{\partial b^{[1]}}=\frac{\partial L}{\partial z^{[2]}}\frac{\partial z^{[2]}}{\partial a^{[1]}}\frac{\partial a^{[1]}}{\partial z^{[1]}}\frac{\partial z^{[1]}}{\partial b^{[1]}}$ . This likewise gives us $dW^{[1]}$ and $db^{[1]}$ .

When using stochastic gradient descent (SGD) as the optimization algorithm together with the cross-entropy loss function, we let $a^{[2]}=\hat{y}$ , that is, the loss function: $L(\hat{y},y)=-(ylog\hat{y}+(1-y)log(1-\hat{y}))$

Using the sigmoid activation function, that is, $a^{[1]}=\sigma(z^{[1]})=\frac{1}{1+e^{-z^{[1]}}}\\a^{[2]}=\sigma(z^{[2]})=\frac{1}{1+e^{-z^{[2]}}}$

Substituting this activation function and loss function into the computation above, we obtain:

dz^{[2]}=a^{[2]}-y\\ dW^{[2]}=dz^{[2]}a^{[1]T}\\ db^{[2]}=dz^{[2]}\\ dz^{[1]}=W^{[2]T}dz^{[2]}*\sigma^{'}(z^{[1]})\\ dW^{[1]}=dz^{[1]}x^{T}\\ db^{[1]}=dz^{[1]}

During stochastic gradient descent, we randomly select a misclassified point from the samples, compute the current $dW^{[1]},db^{[1]},dW^{[2]},db^{[2]}$ based on that point, and then use the following formulas to update $W^{[1]},b^{[1]},W^{[2]},b^{[2]}$ :

W^{[2]}:=W^{[2]}-\alpha *dW^{[2]}\\ b^{[2]}:=b^{[2]}-\alpha *db^{[2]}\\ W^{[1]}:=W^{[1]}-\alpha *dW^{[1]}\\ b^{[1]}:=b^{[1]}-\alpha *db^{[1]}

until convergence.

For training neural networks, there are also methods such as batch gradient descent, mini-batch gradient descent, stochastic gradient descent with momentum, RMSProp, and Adam, which I will explain in detail later.

To be continued…

Technology

2018 · 06 · 01