Derivation of Neural Network Backpropagation

Language:中文/EN

Context (title): Derivation of Backpropagation in Neural Networks

For the training process of neural networks, the backpropagation algorithm is the core of the training process. Neural networks calculate the gradient of the loss function with respect to each parameter from back to front based on the deviation between the predicted value y^\hat{y} and the actual value yy. This allows the use of gradient descent to optimize the parameters of the neural network.

The computation flowchart of the neural network is as follows: Image Description Here From this flowchart, we can see that to compute the parameters W[1],b[1],W[2],b[2]W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]} of the neural network, we first need to calculate La[2]\frac{\partial L}{\partial a^{[2]}} and a[2]z[2]\frac{\partial a^{[2]}}{\partial z^{[2]}}. Then, using the chain rule, we get Lz[2]=La[2]a[2]z[2]\frac{\partial L}{\partial z^{[2]}}=\frac{\partial L}{\partial a^{[2]}}\frac{\partial a^{[2]}}{\partial z^{[2]}}.

Next, we compute z[2]W[2]\frac{\partial z^{[2]}}{\partial W^{[2]}} and z[2]b[2]\frac{\partial z^{[2]}}{\partial b^{[2]}}. Similarly, using the chain rule, we obtain LW[2]=Lz[2]z[2]W[2]\frac{\partial L}{\partial W^{[2]}}=\frac{\partial L}{\partial z^{[2]}}\frac{\partial z^{[2]}}{\partial W^{[2]}} and Lb[2]=Lz[2]z[2]b[2]\frac{\partial L}{\partial b^{[2]}}=\frac{\partial L}{\partial z^{[2]}}\frac{\partial z^{[2]}}{\partial b^{[2]}}. This gives us dW[2]dW^{[2]} and db[2]db^{[2]}.

For the calculation of dW[1]dW^{[1]} and db[1]db^{[1]}, we need to first compute z[1]W[1]\frac{\partial z^{[1]}}{\partial W^{[1]}}, a[1]z[1]\frac{\partial a^{[1]}}{\partial z^{[1]}}, and z[2]a[1]\frac{\partial z^{[2]}}{\partial a^{[1]}}. Again, using the chain rule, we get LW[1]=Lz[2]z[2]a[1]a[1]z[1]z[1]W[1]\frac{\partial L}{\partial W^{[1]}}=\frac{\partial L}{\partial z^{[2]}}\frac{\partial z^{[2]}}{\partial a^{[1]}}\frac{\partial a^{[1]}}{\partial z^{[1]}}\frac{\partial z^{[1]}}{\partial W^{[1]}}, and Lb[1]=Lz[2]z[2]a[1]a[1]z[1]z[1]b[1]\frac{\partial L}{\partial b^{[1]}}=\frac{\partial L}{\partial z^{[2]}}\frac{\partial z^{[2]}}{\partial a^{[1]}}\frac{\partial a^{[1]}}{\partial z^{[1]}}\frac{\partial z^{[1]}}{\partial b^{[1]}}. This also gives us dW[1]dW^{[1]} and db[1]db^{[1]}.

When using the Stochastic Gradient Descent (SGD) optimization algorithm and the Cross Entropy loss function, we set a[2]=y^a^{[2]}=\hat{y}, i.e., the loss function is: L(y^,y)=(ylogy^+(1y)log(1y^))L(\hat{y},y)=-(y\log\hat{y}+(1-y)\log(1-\hat{y}))

Using the sigmoid activation function, we have: a[1]=σ(z[1])=11+ez[1]a[2]=σ(z[2])=11+ez[2]a^{[1]}=\sigma(z^{[1]})=\frac{1}{1+e^{-z^{[1]}}}\\a^{[2]}=\sigma(z^{[2]})=\frac{1}{1+e^{-z^{[2]}}}

Substituting this activation function and loss function into the above computation process, we get:

dz[2]=a[2]ydW[2]=dz[2]a[1]Tdb[2]=dz[2]dz[1]=W[2]Tdz[2]σ(z[1])dW[1]=dz[1]xTdb[1]=dz[1]dz^{[2]}=a^{[2]}-y\\ dW^{[2]}=dz^{[2]}a^{[1]T}\\ db^{[2]}=dz^{[2]}\\ dz^{[1]}=W^{[2]T}dz^{[2]}*\sigma^{'}(z^{[1]})\\ dW^{[1]}=dz^{[1]}x^{T}\\ db^{[1]}=dz^{[1]}

During the stochastic gradient descent process, a misclassified point from the sample is randomly selected. Based on this point, the current dW[1],db[1],dW[2],db[2]dW^{[1]}, db^{[1]}, dW^{[2]}, db^{[2]} are calculated, and then the following formulas are used to update W[1],b[1],W[2],b[2]W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]}:

W[2]:=W[2]αdW[2]b[2]:=b[2]αdb[2]W[1]:=W[1]αdW[1]b[1]:=b[1]αdb[1]W^{[2]}:=W^{[2]}-\alpha *dW^{[2]}\\ b^{[2]}:=b^{[2]}-\alpha *db^{[2]}\\ W^{[1]}:=W^{[1]}-\alpha *dW^{[1]}\\ b^{[1]}:=b^{[1]}-\alpha *db^{[1]}

until convergence.

For training neural networks, there are also Batch Gradient Descent, Mini-Batch Gradient Descent, Momentum-based Stochastic Gradient Descent, RMSProp, Adam, and other methods, which will be detailed later.

To be continued…