Derivation of Neural Network Backpropagation

Context (title): Derivation of Backpropagation in Neural Networks

For the training process of neural networks, the backpropagation algorithm is the core of the training process. Neural networks calculate the gradient of the loss function with respect to each parameter from back to front based on the deviation between the predicted value $\hat{y}$ and the actual value $y$ . This allows the use of gradient descent to optimize the parameters of the neural network.

The computation flowchart of the neural network is as follows: Image Description Here From this flowchart, we can see that to compute the parameters $W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]}$ of the neural network, we first need to calculate $\frac{\partial L}{\partial a^{[2]}}$ and $\frac{\partial a^{[2]}}{\partial z^{[2]}}$ . Then, using the chain rule, we get $\frac{\partial L}{\partial z^{[2]}}=\frac{\partial L}{\partial a^{[2]}}\frac{\partial a^{[2]}}{\partial z^{[2]}}$ .

Next, we compute $\frac{\partial z^{[2]}}{\partial W^{[2]}}$ and $\frac{\partial z^{[2]}}{\partial b^{[2]}}$ . Similarly, using the chain rule, we obtain $\frac{\partial L}{\partial W^{[2]}}=\frac{\partial L}{\partial z^{[2]}}\frac{\partial z^{[2]}}{\partial W^{[2]}}$ and $\frac{\partial L}{\partial b^{[2]}}=\frac{\partial L}{\partial z^{[2]}}\frac{\partial z^{[2]}}{\partial b^{[2]}}$ . This gives us $dW^{[2]}$ and $db^{[2]}$ .

For the calculation of $dW^{[1]}$ and $db^{[1]}$ , we need to first compute $\frac{\partial z^{[1]}}{\partial W^{[1]}}$ , $\frac{\partial a^{[1]}}{\partial z^{[1]}}$ , and $\frac{\partial z^{[2]}}{\partial a^{[1]}}$ . Again, using the chain rule, we get $\frac{\partial L}{\partial W^{[1]}}=\frac{\partial L}{\partial z^{[2]}}\frac{\partial z^{[2]}}{\partial a^{[1]}}\frac{\partial a^{[1]}}{\partial z^{[1]}}\frac{\partial z^{[1]}}{\partial W^{[1]}}$ , and $\frac{\partial L}{\partial b^{[1]}}=\frac{\partial L}{\partial z^{[2]}}\frac{\partial z^{[2]}}{\partial a^{[1]}}\frac{\partial a^{[1]}}{\partial z^{[1]}}\frac{\partial z^{[1]}}{\partial b^{[1]}}$ . This also gives us $dW^{[1]}$ and $db^{[1]}$ .

When using the Stochastic Gradient Descent (SGD) optimization algorithm and the Cross Entropy loss function, we set $a^{[2]}=\hat{y}$ , i.e., the loss function is: $L(\hat{y},y)=-(y\log\hat{y}+(1-y)\log(1-\hat{y}))$

Using the sigmoid activation function, we have: $a^{[1]}=\sigma(z^{[1]})=\frac{1}{1+e^{-z^{[1]}}}\\a^{[2]}=\sigma(z^{[2]})=\frac{1}{1+e^{-z^{[2]}}}$

Substituting this activation function and loss function into the above computation process, we get:

dz^{[2]}=a^{[2]}-y\\ dW^{[2]}=dz^{[2]}a^{[1]T}\\ db^{[2]}=dz^{[2]}\\ dz^{[1]}=W^{[2]T}dz^{[2]}*\sigma^{'}(z^{[1]})\\ dW^{[1]}=dz^{[1]}x^{T}\\ db^{[1]}=dz^{[1]}

During the stochastic gradient descent process, a misclassified point from the sample is randomly selected. Based on this point, the current $dW^{[1]}, db^{[1]}, dW^{[2]}, db^{[2]}$ are calculated, and then the following formulas are used to update $W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]}$ :

W^{[2]}:=W^{[2]}-\alpha *dW^{[2]}\\ b^{[2]}:=b^{[2]}-\alpha *db^{[2]}\\ W^{[1]}:=W^{[1]}-\alpha *dW^{[1]}\\ b^{[1]}:=b^{[1]}-\alpha *db^{[1]}

until convergence.

For training neural networks, there are also Batch Gradient Descent, Mini-Batch Gradient Descent, Momentum-based Stochastic Gradient Descent, RMSProp, Adam, and other methods, which will be detailed later.

To be continued…