Proof of Backpropagation


graph LR A(($a^1$)) -->|$\theta^1$| B(($z^2$)); B -->|$g$| C(($a^2$)); C -->|$\theta^2$| D(($z^3$)); D -->|$g$| E(($a^3$))

Forward Pass

Define $a_{0}$ as the bias, the subscript ij represents the direction from output nodes i: $a^{(n)}_{i}$ to input nodes j: $z^{(n+1)}_{j}$, the superscript represents the layer index

$$z^{(2)} = \theta^{(1)}a^{(1)} $$

For example,

$$\left[ \begin{array}{} z^{(2)}_{1} \\ z^{(2)}_{2} \\ \end{array} \right] = \left[ \begin{array}{} \theta^{(1)}_{01} & \theta^{(1)}_{11} & \theta^{(1)}_{21} \\ \theta^{(1)}_{02} & \theta^{(1)}_{12} & \theta^{(1)}_{22} \\ \end{array} \right] \space \left[ \begin{array}{} a^{(1)}_{0}\\ a^{(1)}_{1}\\ a^{(1)}_{2}\\ \end{array} \right] $$

Bias will be added back before next layer of matrix multiplication.

$$a^{(2)} = \left[ \begin{array}{} a^{(2)}_{1}\\ a^{(2)}_{2}\\ \end{array} \right] \to \left[ \begin{array}{} a^{(2)}_{0}\\ a^{(2)}_{1}\\ a^{(2)}_{2}\\ \end{array} \right]$$

Similarily for layer 3,

$$z^{(3)} = \theta^{(2)}a^{(2)} $$ $$\left[ \begin{array}{} z^{(3)}_{1} \\ z^{(3)}_{2} \\ \end{array} \right] = \left[ \begin{array}{} \theta^{(2)}_{01} & \theta^{(2)}_{11} & \theta^{(2)}_{21} \\ \theta^{(2)}_{02} & \theta^{(2)}_{12} & \theta^{(2)}_{22} \\ \end{array} \right] \space \left[ \begin{array}{} a^{(2)}_{0}\\ a^{(2)}_{1}\\ a^{(2)}_{2}\\ \end{array} \right] $$ $$a^{(3)} = g(z^{(3)}) = \left[ \begin{array}{} \frac{1}{1+ e^{-z^{(3)}_{1}}}\\ \frac{1}{1+ e^{-z^{(3)}_{2}}}\\ \end{array} \right] = \left[ \begin{array}{} a^{(3)}_{1}\\ a^{(3)}_{2}\\ \end{array} \right]$$

But since layer 3 is the last layer, no bias will be added back.


Cost function

$$\begin{align*} J(\theta) &= \Delta = \Delta_{1} + \Delta_{2}=\frac{1}{2}(\delta^{(3)}_{1})^{2}+\frac{1}{2}(\delta^{(3)}_{2})^{2}\\ &= \frac{1}{2}\big((a^{(3)}_{1} - y_{1})^2+(a^{(3)}_{2} - y_{2})^2\big) \end{align*}$$

Gradient Descent for each weight

$$\theta^{(k)}_{ij} := \theta^{(k)}_{ij} - \alpha \frac{\partial J(\theta)}{\partial \theta^{(k)}_{ij}}$$

Backward pass

For $\theta^{(2)}$, $$\frac{\partial \Delta}{\partial \theta^{(2)}_{ij}} = \frac{\mathrm d \Delta}{\mathrm d \Delta_{j}} \frac{\mathrm d \Delta_{j}}{\mathrm d \delta^{(3)}_{j}} \frac{\partial \delta^{(3)}_{j}}{\partial a^{(3)}_{j}} \frac{\mathrm d a^{(3)}_{j}}{\mathrm d z^{(3)}_{j}} \frac{\partial z^{(3)}_{j}}{\partial \theta^{(2)}_{ij}} = a^{(2)}_{i}g'(z^{(3)}_{j})\delta^{(3)}_{j}$$ $$\frac{\partial \Delta}{\partial \theta^{(2)}_{11}} = \frac{\mathrm d \Delta}{\mathrm d \Delta_{1}} \frac{\mathrm d \Delta_{1}}{\mathrm d \delta^{(3)}_{1}} \frac{\partial \delta^{(3)}_{1}}{\partial a^{(3)}_{1}} \frac{\mathrm d a^{(3)}_{1}}{\mathrm d z^{(3)}_{1}} \frac{\partial z^{(3)}_{1}}{\partial \theta^{(2)}_{11}} = (1)(\delta^{(3)}_{1})(1)\bigg(\frac{ e^{-z^{(3)}_{1}}}{(1+ e^{-z^{(3)}_{1}})^{2}}\bigg)a^{(2)}_{1} = (\delta^{(3)}_{1})\bigg(\frac{ e^{-z^{(3)}_{1}}}{(1+ e^{-z^{(3)}_{1}})^{2}}\bigg)a^{(2)}_{1} = a^{(2)}_{1}a^{(3)}_{1}(1-a^{(3)}_{1})\delta^{(3)}_{1} = a^{(2)}_{1}g'(z^{(3)}_{1})\delta^{(3)}_{1}$$ $$, where \space \space g'(z^{(3)}_{1}) = a^{(3)}_{1}(1 - a^{(3)}_{1});$$ $$\frac{\partial \Delta}{\partial \theta^{(2)}_{21}} = \frac{\mathrm d \Delta}{\mathrm d \Delta_{1}} \frac{\mathrm d \Delta_{1}}{\mathrm d \delta^{(3)}_{1}} \frac{\partial \delta^{(3)}_{1}}{\partial a^{(3)}_{1}} \frac{\mathrm d a^{(3)}_{1}}{\mathrm d z^{(3)}_{1}} \frac{\partial z^{(3)}_{1}}{\partial \theta^{(2)}_{21}} = a^{(2)}_{2}g'(z^{(3)}_{1})\delta^{(3)}_{1}; $$ $$\frac{\partial \Delta}{\partial \theta^{(2)}_{12}} = \frac{\mathrm d \Delta}{\mathrm d \Delta_{2}} \frac{\mathrm d \Delta_{2}}{\mathrm d \delta^{(3)}_{2}} \frac{\partial \delta^{(3)}_{2}}{\partial a^{(3)}_{2}} \frac{\mathrm d a^{(3)}_{2}}{\mathrm d z^{(3)}_{2}} \frac{\partial z^{(3)}_{2}}{\partial \theta^{(2)}_{12}} = a^{(2)}_{1}g'(z^{(3)}_{2})\delta^{(3)}_{2}; $$ $$\frac{\partial \Delta}{\partial \theta^{(2)}_{22}} = \frac{\mathrm d \Delta}{\mathrm d \Delta_{2}} \frac{\mathrm d \Delta_{2}}{\mathrm d \delta^{(3)}_{2}} \frac{\partial \delta^{(3)}_{2}}{\partial a^{(3)}_{2}} \frac{\mathrm d a^{(3)}_{2}}{\mathrm d z^{(3)}_{2}} \frac{\partial z^{(3)}_{2}}{\partial \theta^{(2)}_{22}} = a^{(2)}_{2}g'(z^{(3)}_{2})\delta^{(3)}_{2}$$

Generally,

$$\frac{\partial J(\theta)}{\partial \theta^{(\ell)}_{ij}} = a^{(\ell)}_{i}g'(z^{\ell + 1}_{j})\delta^{\ell + 1}_{j} $$

For $\theta^{(1)}$,

$$\frac{\partial \Delta}{\partial \theta^{(1)}_{ij}} = \bigg( \frac{\mathrm d \Delta}{\mathrm d \Delta_1}\frac{\mathrm d \Delta_1}{\mathrm d \delta^{(3)}_{1}}\frac{\partial \delta^{(3)}_{1}}{\partial a^{(3)}_{1}}\frac{\mathrm d a^{(3)}_{1}}{\mathrm d z^{(3)}_{1}} \frac{\partial z^{(3)}_{1}}{\partial a^{(2)}_{j}} + \frac{\mathrm d \Delta}{\mathrm d \Delta_2}\frac{\mathrm d \Delta_2}{\mathrm d \delta^{(3)}_{2}}\frac{\partial \delta^{(3)}_{2}}{\partial a^{(3)}_{2}}\frac{\mathrm d a^{(3)}_{2}}{\mathrm d z^{(3)}_{2}} \frac{\partial z^{(3)}_{2}}{\partial a^{(2)}_{j}} \bigg) \frac{\mathrm d a^{(2)}_{j}}{\mathrm d z^{(2)}_{j}} \frac{\partial z^{(2)}_{j}}{\partial \theta^{(1)}_{ij}} $$ $$\frac{\partial \Delta}{\partial \theta^{(1)}_{11}} = \bigg( \frac{\mathrm d \Delta}{\mathrm d \Delta_1}\frac{\mathrm d \Delta_1}{\mathrm d \delta^{(3)}_{1}}\frac{\partial \delta^{(3)}_{1}}{\partial a^{(3)}_{1}}\frac{\mathrm d a^{(3)}_{1}}{\mathrm d z^{(3)}_{1}} \frac{\partial z^{(3)}_{1}}{\partial a^{(2)}_{1}} + \frac{\mathrm d \Delta}{\mathrm d \Delta_2}\frac{\mathrm d \Delta_2}{\mathrm d \delta^{(3)}_{2}}\frac{\partial \delta^{(3)}_{2}}{\partial a^{(3)}_{2}}\frac{\mathrm d a^{(3)}_{2}}{\mathrm d z^{(3)}_{2}} \frac{\partial z^{(3)}_{2}}{\partial a^{(2)}_{1}} \bigg) \frac{\mathrm d a^{(2)}_{1}}{\mathrm d z^{(2)}_{1}} \frac{\partial z^{(2)}_{1}}{\partial \theta^{(1)}_{11}} = \big(\delta^{(3)}_{1}g'(z^{(3)}_{1})\theta^{(2)}_{11} + \delta^{(3)}_{2}g'(z^{(3)}_{2})\theta^{(2)}_{12}\big)g'(z^{(2)}_{1})a^{(1)}_1 = a^{(1)}_{1}g'(z^{(2)}_{1})\delta^{(2)}_{1}$$ $$\frac{\partial \Delta}{\partial \theta^{(1)}_{12}} = \bigg( \frac{\mathrm d \Delta}{\mathrm d \Delta_1}\frac{\mathrm d \Delta_1}{\mathrm d \delta^{(3)}_{1}}\frac{\partial \delta^{(3)}_{1}}{\partial a^{(3)}_{1}}\frac{\mathrm d a^{(3)}_{1}}{\mathrm d z^{(3)}_{1}} \frac{\partial z^{(3)}_{1}}{\partial a^{(2)}_{2}} + \frac{\mathrm d \Delta}{\mathrm d \Delta_2}\frac{\mathrm d \Delta_2}{\mathrm d \delta^{(3)}_{2}}\frac{\partial \delta^{(3)}_{2}}{\partial a^{(3)}_{2}}\frac{\mathrm d a^{(3)}_{2}}{\mathrm d z^{(3)}_{2}} \frac{\partial z^{(3)}_{2}}{\partial a^{(2)}_{2}} \bigg) \frac{\mathrm d a^{(2)}_{2}}{\mathrm d z^{(2)}_{2}} \frac{\partial z^{(2)}_{2}}{\partial \theta^{(1)}_{12}} = \big(\delta^{(3)}_{1}g'(z^{(3)}_{1})\theta^{(2)}_{21} + \delta^{(3)}_{2}g'(z^{(3)}_{2})\theta^{(2)}_{22}\big)g'(z^{(2)}_{2})a^{(1)}_1 = a^{(1)}_{1}g'(z^{(2)}_{2})\delta^{(2)}_{2}$$ $$\frac{\partial \Delta}{\partial \theta^{(1)}_{21}} = \bigg( \frac{\mathrm d \Delta}{\mathrm d \Delta_1}\frac{\mathrm d \Delta_1}{\mathrm d \delta^{(3)}_{1}}\frac{\partial \delta^{(3)}_{1}}{\partial a^{(3)}_{1}}\frac{\mathrm d a^{(3)}_{1}}{\mathrm d z^{(3)}_{1}} \frac{\partial z^{(3)}_{1}}{\partial a^{(2)}_{1}} + \frac{\mathrm d \Delta}{\mathrm d \Delta_2}\frac{\mathrm d \Delta_2}{\mathrm d \delta^{(3)}_{2}}\frac{\partial \delta^{(3)}_{2}}{\partial a^{(3)}_{2}}\frac{\mathrm d a^{(3)}_{2}}{\mathrm d z^{(3)}_{2}} \frac{\partial z^{(3)}_{2}}{\partial a^{(2)}_{1}} \bigg) \frac{\mathrm d a^{(2)}_{1}}{\mathrm d z^{(2)}_{1}} \frac{\partial z^{(2)}_{1}}{\partial \theta^{(1)}_{21}} = a^{(1)}_{2}g'(z^{(2)}_{1})\delta^{(2)}_{1}$$ $$\frac{\partial \Delta}{\partial \theta^{(1)}_{22}} = \bigg( \frac{\mathrm d \Delta}{\mathrm d \Delta_1}\frac{\mathrm d \Delta_1}{\mathrm d \delta^{(3)}_{1}}\frac{\partial \delta^{(3)}_{1}}{\partial a^{(3)}_{1}}\frac{\mathrm d a^{(3)}_{1}}{\mathrm d z^{(3)}_{1}} \frac{\partial z^{(3)}_{1}}{\partial a^{(2)}_{2}} + \frac{\mathrm d \Delta}{\mathrm d \Delta_2}\frac{\mathrm d \Delta_2}{\mathrm d \delta^{(3)}_{2}}\frac{\partial \delta^{(3)}_{2}}{\partial a^{(3)}_{2}}\frac{\mathrm d a^{(3)}_{2}}{\mathrm d z^{(3)}_{2}} \frac{\partial z^{(3)}_{2}}{\partial a^{(2)}_{2}} \bigg) \frac{\mathrm d a^{(2)}_{2}}{\mathrm d z^{(2)}_{2}} \frac{\partial z^{(2)}_{2}}{\partial \theta^{(1)}_{22}} = a^{(1)}_{2}g'(z^{(2)}_{2})\delta^{(2)}_{2}$$ $$ \delta^{(2)} = \left[ \begin{array}{} \delta^{(2)}_{1}\\ \delta^{(2)}_{2}\\ \end{array}\right] = \left[ \begin{array}{} \theta^{(2)}_{11}\delta^{(3)}_1g'(z^{(3)}_1) + \theta^{(2)}_{12}\delta^{(3)}_2g'(z^{(3)}_2) \\ \theta^{(2)}_{21}\delta^{(3)}_1g'(z^{(3)}_1) + \theta^{(2)}_{22}\delta^{(3)}_2g'(z^{(3)}_2) \\ \end{array}\right] = \left[ \begin{array}{} \theta^{(2)}_{11} & \theta^{(2)}_{12} \\ \theta^{(2)}_{21} & \theta^{(2)}_{22} \\ \end{array}\right] \space \left[ \begin{array}{} \delta^{(3)}_1g'(z^{(3)}_1) \\ \delta^{(3)}_2g'(z^{(3)}_2) \\ \end{array}\right] = (\theta^{(2)})^{T} \delta^{(3)}.*g'(z^{(3)})$$

Generally,

$$\delta^{(\ell)}= \begin{cases} a^{(\ell)} - y & \text{for} \space \ell = L\\ (\theta^{(\ell)})^{T}\delta^{(\ell + 1)}.*g'(z^{(\ell + 1)}) & \text{for} \space 1<\ell \leq L-1 \end{cases} $$

where L = last layer


Summary

$$ z^{(\ell + 1)} = \theta^{(\ell)} \space a^{(\ell)}$$ $$ \frac{\partial J(\theta)}{\partial \theta^{(\ell)}_{ij}} = a^{(\ell)}_{i}g'(z^{(\ell + 1)}_{j})\delta^{(\ell + 1)}_{j} $$ $$ \delta^{(\ell)}= \begin{cases} a^{(\ell)} - y \\ (\theta^{(\ell)})^{T}\delta^{(\ell + 1)}.*g'(z^{(\ell + 1)}) \end{cases} $$ $$ g'(z^{(\ell)}) = a^{(\ell)}\big(1 - a^{(\ell)}\big)$$

The form of weight matrix is given by,

$$\theta^{(\ell)} = \left[ \begin{array}{} \theta^{(\ell)}_{11} & \theta^{(\ell)}_{21} & \cdots & \theta^{(\ell)}_{n1} \\ \theta^{(\ell)}_{12} & \theta^{(\ell)}_{22} & ... & \theta^{(\ell)}_{n2} \\ \vdots & \vdots & \ddots & \vdots \\ \theta^{(\ell)}_{1n} & \cdots & \cdots & \theta^{(\ell)}_{nn} \end{array}\right] $$

References