Deriving Backpropagation for Neural Networks

For the training process of a neural network, the backpropagation algorithm lies at its core. Based on the deviation between the predicted value y^\hat{y} and the actual value yy, the network computes the gradients of the loss function with respect to each parameter from back to front, and then uses gradient descent to optimize and train the network’s parameters.

The computation graph of the neural network is shown below:

Computation graph for neural network backpropagation
The computation graph of the neural network: showing the gradient computation paths of each parameter in forward and backward propagation

From this computation graph, we can see that if we want to compute the network’s parameters W[1],b[1],W[2],b[2]W^{[1]},b^{[1]},W^{[2]},b^{[2]}, we first need to compute La[2]\frac{\partial L}{\partial a^{[2]}} and a[2]z[2]\frac{\partial a^{[2]}}{\partial z^{[2]}}, and then use the chain rule to obtain Lz[2]=La[2]a[2]z[2]\frac{\partial L}{\partial z^{[2]}}=\frac{\partial L}{\partial a^{[2]}}\frac{\partial a^{[2]}}{\partial z^{[2]}}.

Next we compute z[2]W[2]\frac{\partial z^{[2]}}{\partial W^{[2]}} and z[2]b[2]\frac{\partial z^{[2]}}{\partial b^{[2]}}. Again, by the chain rule, we obtain LW[2]=Lz[2]z[2]W[2]\frac{\partial L}{\partial W^{[2]}}=\frac{\partial L}{\partial z^{[2]}}\frac{\partial z^{[2]}}{\partial W^{[2]}} as well as Lb[2]=Lz[2]z[2]b[2]\frac{\partial L}{\partial b^{[2]}}=\frac{\partial L}{\partial z^{[2]}}\frac{\partial z^{[2]}}{\partial b^{[2]}}. This gives us dW[2]dW^{[2]} and db[2]db^{[2]}.

In addition, to compute dW[1]dW^{[1]} and db[1]db^{[1]}, we first need to compute z[1]W[1]\frac{\partial z^{[1]}}{\partial W^{[1]}}, a[1]z[1]\frac{\partial a^{[1]}}{\partial z^{[1]}} and z[2]a[1]\frac{\partial z^{[2]}}{\partial a^{[1]}}. Again, by the chain rule, we obtain LW[1]=Lz[2]z[2]a[1]a[1]z[1]z[1]W[1]\frac{\partial L}{\partial W^{[1]}}=\frac{\partial L}{\partial z^{[2]}}\frac{\partial z^{[2]}}{\partial a^{[1]}}\frac{\partial a^{[1]}}{\partial z^{[1]}}\frac{\partial z^{[1]}}{\partial W^{[1]}}, as well as Lb[1]=Lz[2]z[2]a[1]a[1]z[1]z[1]b[1]\frac{\partial L}{\partial b^{[1]}}=\frac{\partial L}{\partial z^{[2]}}\frac{\partial z^{[2]}}{\partial a^{[1]}}\frac{\partial a^{[1]}}{\partial z^{[1]}}\frac{\partial z^{[1]}}{\partial b^{[1]}}. This likewise gives us dW[1]dW^{[1]} and db[1]db^{[1]}.

When using stochastic gradient descent (SGD) as the optimization algorithm together with the cross-entropy loss function, we let a[2]=y^a^{[2]}=\hat{y}, that is, the loss function: L(y^,y)=(ylogy^+(1y)log(1y^))L(\hat{y},y)=-(ylog\hat{y}+(1-y)log(1-\hat{y}))

Using the sigmoid activation function, that is, a[1]=σ(z[1])=11+ez[1]a[2]=σ(z[2])=11+ez[2]a^{[1]}=\sigma(z^{[1]})=\frac{1}{1+e^{-z^{[1]}}}\\a^{[2]}=\sigma(z^{[2]})=\frac{1}{1+e^{-z^{[2]}}}

Substituting this activation function and loss function into the computation above, we obtain:

dz[2]=a[2]ydW[2]=dz[2]a[1]Tdb[2]=dz[2]dz[1]=W[2]Tdz[2]σ(z[1])dW[1]=dz[1]xTdb[1]=dz[1]dz^{[2]}=a^{[2]}-y\\ dW^{[2]}=dz^{[2]}a^{[1]T}\\ db^{[2]}=dz^{[2]}\\ dz^{[1]}=W^{[2]T}dz^{[2]}*\sigma^{'}(z^{[1]})\\ dW^{[1]}=dz^{[1]}x^{T}\\ db^{[1]}=dz^{[1]}

During stochastic gradient descent, we randomly select a misclassified point from the samples, compute the current dW[1],db[1],dW[2],db[2]dW^{[1]},db^{[1]},dW^{[2]},db^{[2]} based on that point, and then use the following formulas to update W[1],b[1],W[2],b[2]W^{[1]},b^{[1]},W^{[2]},b^{[2]}:

W[2]:=W[2]αdW[2]b[2]:=b[2]αdb[2]W[1]:=W[1]αdW[1]b[1]:=b[1]αdb[1]W^{[2]}:=W^{[2]}-\alpha *dW^{[2]}\\ b^{[2]}:=b^{[2]}-\alpha *db^{[2]}\\ W^{[1]}:=W^{[1]}-\alpha *dW^{[1]}\\ b^{[1]}:=b^{[1]}-\alpha *db^{[1]}

until convergence.

For training neural networks, there are also methods such as batch gradient descent, mini-batch gradient descent, stochastic gradient descent with momentum, RMSProp, and Adam, which I will explain in detail later.

To be continued…