A Detailed Derivation of Backpropagation

Outline

There are several approaches to calculating backpropagation (BP) for multi-layer perceptrons (MLPs). Our goal is to compute the gradient of the loss with respect to the weight matrix for each layer, which represents a scalar-to-matrix derivative. We can use the following methods:

Calculate $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}}$ directly using matrix-to-matrix gradients. (We won't use this approach because matrix-to-matrix gradients are difficult to compute)
Calculate $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}}$ directly while avoiding vector-to-matrix gradients. (We won't use this approach either, as it's still quite challenging)
Calculate $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}_{i,j}}$ for each element and then assemble them into a matrix. (This is the approach we'll adopt)

For our chosen method, while computing $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}_{i,j}}$ is sufficient to determine all gradients and you could write a for-loop to complete the update, this approach isn't efficient. Modern accelerators like GPUs can significantly speed up matrix multiplication operations. Therefore, we still need to assemble the scalar results into matrix form.

We'll first examine the gradient calculation for a single example. Since the SGD algorithm requires a batch of examples, we'll then extend our results to handle batches of data.

Preliminaries

To make our calculations more straightforward, we'll first introduce three key concepts you should understand: denominator layout, the multivariable chain rule, and matrix assembly.

Denominator Layout: In the deep learning community, derivatives are computed using the denominator layout by default. This means that a scalar-to-matrix gradient will result in a matrix with the same shape as the original matrix. If the result is the transpose of the original matrix, then we're using the denominator layout. You can learn more about denominator layout from this Wikipedia article.

Multivariable Chain Rule: We need to understand the multivariable chain rule. If $x=x(t)$ , $y=y(t)$ , and $z=f(x,y)$ , then we have $\frac{\partial z}{\partial t}=\frac{\partial f}{\partial x}\frac{\partial x}{\partial t}+\frac{\partial f}{\partial y}\frac{\partial y}{\partial t}$ . Here's an article on the multivariable chain rule. Since we calculate gradients in a scalar manner, a function that accepts a vector or matrix becomes a multivariable function. For example, $f(\mathbf{x})$ becomes $f(\mathbf{x}_0, \mathbf{x}_1, \cdots, \mathbf{x}_n)$ . Therefore, in the following derivation, we'll always use the multivariable chain rule.

Matrix Assembly: Finally, we need to understand how to assemble a vector or matrix. For a matrix $\mathbf{W}$ , we can write it as $\mathbf{W}=[\mathbf{W}_{i,j}]_{i,j}$ , where the indices $i$ and $j$ indicate that we need to iterate through each row and column. Similarly, for a vector, we have $\mathbf{v}=[\mathbf{v}_i]_i$ . Assembly can be thought of as the reverse process of matrix multiplication. For example, the multiplication of two column vectors $\mathbf{xy}^\top=[\mathbf{x}_i\mathbf{y}_j]_{i,j}$ . Thus, if we obtain a scalar result where $\mathbf{W}_{i,j}=\mathbf{x}_i\mathbf{y}_j$ , we can assemble it into the matrix $\mathbf{W}=\mathbf{x}\mathbf{y}^T$ .

Definitions

Notation:

Scalars: $x, y, \cdots$
Vectors: $\mathbf{x}, \mathbf{y}, \cdots$
Matrices: $\mathbf{X}, \mathbf{Y}, \cdots$
Subscript notation: $\mathbf{x}_i$ represents the $i$ -th element of vector $\mathbf{x}$ , which is a scalar.
Indicator function: $\mathbf{1}_{ij}$ equals 1 if $i=j$ and 0 otherwise.

Network Architecture:

Input: $\mathbf{x}$ with shape $[n, 1]$
Label: $\mathbf{y}$ with shape $[c, 1]$ (one-hot encoded)
Number of layers: $L$
Number of classes: $c$
Linear transformation: $\mathbf{Wx+b}$
Weight matrix at layer $l$ : $\mathbf{W}^{(l)}$

Weight Matrix Shapes:

First layer: $\mathbf{W}^{(1)}$ with shape $[m^{(1)}, m^{(0)}=n]$
Hidden layers (from 2 to $L-1$ ): $\mathbf{W}^{(l)}$ with shape $[m^{(l)}, m^{(l-1)}]$
Last layer: $\mathbf{W}^{(L)}$ with shape $[c, m^{(L-1)}]$

Activation Function and Activations:

Activation function: $f$
Activation at layer $l$ : $\mathbf{a}^{(l)}$

Activation Shapes:

Input: $\mathbf{a}^{(0)}$ with shape $[n, 1]$
Hidden layer activations (from 1 to $L-1$ ): $\mathbf{a}^{(l)}$ with shape $[m^{(l)}, 1]$
Output logits (last layer): $\mathbf{a}^{(L)}$ with shape $[c, 1]$

Forward Pass

\begin{align*} \mathbf{a}^{(0)} &= \mathbf{x} \\ \mathbf{z}^{(1)} &= \mathbf{W}^{(1)}\mathbf{a}^{(0)} + \mathbf{b}^{(1)} \\ \mathbf{a}^{(1)} &= f(\mathbf{z}^{(1)}) \\ \vdots & \quad\vdots\\ \mathbf{z}^{(l)} &= \mathbf{W}^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)} \\ \mathbf{a}^{(l)} &= f(\mathbf{z}^{(l-1)}) \\ \vdots & \quad\vdots\\ \mathbf{z}^{(L)} &= \mathbf{W}^{(L)}\mathbf{a}^{(L-1)} + \mathbf{b}^{(L)} \\ \mathbf{a}^{(L)} &= \text{softmax}(\mathbf{z}^{(L-1)}) \\ \mathcal{L} &=\text{CE}(\mathbf{a}^{(L)},y) \\ \end{align*}

where the cross-entropy loss is $\mathcal{L}=\text{CE}(\mathbf{a}^{(L)},y)=-\sum_{i=1}^c \mathbf{y}_i\log(\mathbf{a}_i^{(L)})$ .

Our goal is to calculate the gradient of $\mathcal{L}$ with respect to $\mathbf{W}^{(l)}$ and $\mathbf{b}^{(l)}$ . In this derivation, we'll focus on computing the gradient with respect to $\mathbf{W}^{(l)}$ . The gradient with respect to $\mathbf{b}^{(l)}$ follows a similar pattern.

The Last Layer

Since the last layer differs from the other layers (it uses softmax instead of a regular activation function), we'll calculate its gradient separately.

In the forward pass, $\mathbf{a}^{(L)}_i=\frac{e^{\mathbf{z}_i}}{\sum_{k=1}^c e^{\mathbf{z}_k}}$ represents the normalized probability output by the softmax layer. A straightforward calculation of the softmax gradient yields $\frac{\partial \mathbf{a}^{(L)}_i}{\partial \mathbf{z}^{(L)}_j}=\mathbf{a}^{(L)}_i(\mathbf{1}_{ij}-\mathbf{a}^{(L)}_j)$ .

To calculate the gradients $\frac{\partial \mathcal{L}}{\partial \mathbf{z}_j^{(L)}}$ , we use the chain rule by introducing the activations $\mathbf{a}$ . An important point here is that when we work in scalar form, most functions we encounter are multivariable functions, so we need to use the multivariable chain rule. For example, $\mathcal{L}$ is the result of the cross-entropy function applied to $\mathbf{a}_0^{(L)}, \mathbf{a}_1^{(L)}, \dots, \mathbf{a}_{c-1}^{(L)}$ . According to the multivariable chain rule, we have $\frac{\partial \mathcal{L}}{\partial \mathbf{z}_j^{(L)}}=\sum_{i=1}^c\frac{\partial \mathcal{L}}{\partial \mathbf{a}_i^{(L)}}\frac{\partial \mathbf{a}_i^{(L)}}{\partial \mathbf{z}_j^{(L)}}$ .

Thus, we have

\begin{align*} \frac{\partial \mathcal{L}}{\partial \mathbf{z}_j^{(L)}} &= \sum_{i=1}^c\frac{\partial \mathcal{L}}{\partial \mathbf{a}_i^{(L)}}\frac{\partial \mathbf{a}_i^{(L)}}{\partial \mathbf{z}_j^{(L)}} \\ & =\sum_{i=1}^c(\frac{\mathbf{y}_i}{\mathbf{a}_i^{(L)}})\frac{\partial \mathbf{a}_i^{(L)}}{\partial \mathbf{z}_j^{(L)}} \\ &=-\sum_{i=1}^c\mathbf{y}_i(\mathbf{1}_{ij}-\mathbf{a}^{(L)}_j) \\ &=-\sum_{i=1}^c\mathbf{y}_i\mathbf{1}_{ij}+\mathbf{a}_j^{(L)}\sum_{i=1}^c\mathbf{y}_i \\ &=\mathbf{a}^{(L)}_j-\mathbf{y}_j \end{align*}

Now we assemble $\frac{\partial \mathcal{L}}{\partial \mathbf{z}_j^{(L)}}$ into a vector: $\frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(L)}}=\mathbf{a}^{(L)}-\mathbf{y}$ . The detailed assembly process is $\frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(L)}}=[\frac{\partial \mathcal{L}}{\partial \mathbf{z}_j^{(L)}}]_{j=1}^{c}=[\mathbf{a}^{(L)}_j-\mathbf{y}_j]_{j=1}^{c}=[\mathbf{a}^{(L)}_j]_{j=1}^{c}-[\mathbf{y}_j]_{j=1}^{c}=\mathbf{a}^{(L)}-\mathbf{y}$ .

Gradients for Weights

We can define the error term as: $\mathbf{\delta}^{(l)}=\frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l)}}$ which is a column vector that simplifies our chain rule calculations. For the last layer, $\mathbf{\delta}^{(L)}=\mathbf{a}^{(L)}-\mathbf{y}$ .

If we can calculate $\mathbf{\delta}^{(l)}$ for all layers $l$ , then we can calculate $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}}$ for all layers. This calculation uses the multivariable chain rule. Since we want to use the chain rule to connect $\mathcal{L}$ and $\mathbf{W}$ through $\mathbf{z}$ , we need to use the multivariable chain rule: $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}_{i,j}}=\sum_{k}\frac{\partial \mathcal{L}}{\partial \mathbf{z}_k^{(l)}}\frac{\partial \mathbf{z}_k^{(l)}}{\partial \mathbf{W}^{(l)}_{i,j}}$ .

Let's recall the matrix multiplication in $\mathbf{z}^{(l)}= \mathbf{W}^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}$ . We have $\mathbf{z}^{(l)}_i=\sum_{j=1}^n \mathbf{W}^{(l)}_{i,j}\mathbf{a}^{(l-1)}_j+\mathbf{b}^{(l)}_i$ . From this, we can see that $\frac{\partial \mathbf{z}^{(l)}_i}{\partial \mathbf{W}^{(l)}_{i,j}}=\mathbf{a}^{(l-1)}_j$ , and $\frac{\partial \mathbf{z}^{(l)}_k}{\partial \mathbf{W}^{(l)}_{i,j}}=0$ for $i\neq k$ (that is, $\mathbf{W}_{i,j}^{(l)}$ only affects the calculation of $\mathbf{z}_i^{(l)}$ ). Therefore, we have:

$\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}_{i,j}}=\frac{\partial \mathcal{L}}{\partial \mathbf{z}_i^{(l)}}\frac{\partial \mathbf{z}_i^{(l)}}{\partial \mathbf{W}^{(l)}_{i,j}}=\mathbf{\delta}_i^{(l)}\mathbf{a}_j^{(l-1)}$

Now let's assemble the result into matrix form. We have $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}}=[\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}_{i,j}}]_{i,j}=[\mathbf{\delta}_i^{(l)}\mathbf{a}_j^{(l-1)}]_{i,j}=\mathbf{\delta}^{(l)}\mathbf{a}^{(l-1)\top}$ .

For the last layer specifically, we have $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(L)}_{i,j}}=(\mathbf{a}^{(L)}_i-\mathbf{y}_i)\mathbf{a}_j^{(L-1)}$ , and $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(L)}}=(\mathbf{a}^{(L)}-\mathbf{y})\mathbf{a}^{(L-1)\top}$ .

Backpropagation Through Layers

Now the only remaining task is to calculate $\mathbf{\delta}^{(l)}$ for all layers $l$ . We can use the chain rule to compute $\mathbf{\delta}^{(l)}$ for all layers:

\begin{align*} \mathbf{\delta}_j^{(l-1)} &= \frac{\partial \mathcal{L}}{\partial \mathbf{z}_j^{(l-1)}} \\ &= \sum_{k}\frac{\partial \mathcal{L}}{\partial \mathbf{z}_k^{(l)}}\frac{\partial \mathbf{z}_k^{(l)}}{\partial \mathbf{z}_j^{(l-1)}} \\ &= \sum_{k}\frac{\partial \mathcal{L}}{\partial \mathbf{z}_k^{(l)}}\frac{\partial \mathbf{z}_k^{(l)}}{\partial \mathbf{a}_k^{(l-1)}}\frac{\partial \mathbf{a}_k^{(l-1)}}{\partial \mathbf{z}_j^{(l-1)}} \\ \end{align*}

The second step follows because $\mathbf{z}_j^{(l-1)}$ contributes to every $\mathbf{z}_k^{(l)}$ through the linear transformation. The third step is because $\mathbf{z}_j^{(l-1)}$ only affects $\mathbf{a}_j^{(l-1)}$ through the nonlinear transformation.

Since $\mathbf{a}_j^{(l-1)}=f(\mathbf{z}_j^{(l-1)})$ , we have $\frac{\partial \mathbf{a}_j^{(l-1)}}{\partial \mathbf{z}_j^{(l-1)}}=f'(\mathbf{z}_j^{(l-1)})$ . Since $\mathbf{z}^{(l)}=\mathbf{W}^{(l)}\mathbf{a}^{(l-1)}+\mathbf{b}^{(l)}$ , from the matrix multiplication, we have $\frac{\partial \mathbf{z}_k^{(l)}}{\partial \mathbf{a}_j^{(l-1)}}=\mathbf{W}^{(l)}_{k,j}$ . Therefore, we have:

$\mathbf{\delta}_j^{(l-1)}=\sum_{k}\mathbf{\delta}_k^{(l)}\mathbf{W}^{(l)}_{k,j}f'(\mathbf{z}_j^{(l-1)})$

Now let's assemble the result into vector form. We have:

\begin{align*} \mathbf{\delta}^{(l-1)} &=[\mathbf{\delta}_j^{(l-1)}]_{j=1}^m=[\sum_{k}\mathbf{\delta}_k^{(l)}\mathbf{W}^{(l)}_{k,j}f'(\mathbf{z}_j^{(l-1)})]_{j=1}^m \\ &=[\sum_{k}\mathbf{\delta}_k^{(l)}\mathbf{W}^{(l)}_{k,j}]_{j=1}^m\odot[f'(\mathbf{z}_j^{(l-1)})]_{j=1}^m \\ &=[\mathbf{W}^{(l)\top}_{:,j} \mathbf{\delta}^{(l)}]_{j=1}^m\odot f'(\mathbf{z}^{(l-1)}) \\ &=\mathbf{W}^{(l)\top}\mathbf{\delta}^{(l)}\odot f'(\mathbf{z}^{(l-1)}) \end{align*}

Batch Processing

We can extend the above calculations to handle batches of data. In our previous discussion, each example was represented as a column vector, but in deep learning programming, we typically represent examples as rows in a matrix. We have $\mathbf{X}=[\mathbf{x}_1^\top,\mathbf{x}_2^\top,\cdots,\mathbf{x}_b^\top]$ with shape $[b, n]$ , where $\mathbf{x}_i$ is a column vector. Similarly, we have $\mathbf{Y}=[\mathbf{y}_1,\mathbf{y}_2,\cdots,\mathbf{y}_b]$ with shape $[b, c]$ , where $\mathbf{y}_i$ is a column vector.

The total loss is $\mathcal{L}=\frac{1}{b}\sum_{i=1}^b\mathcal{L}(\mathbf{x}_i,\mathbf{y}_i)$ . The vectors $\mathbf{a}^{(l)},\mathbf{z}^{(l)},\mathbf{\delta}^{(l)}$ become matrices $\mathbf{A}^{(l)},\mathbf{Z}^{(l)},\mathbf{\Delta}^{(l)}$ respectively, all with shape $[b, m^{(l)}]$ . For the linear transformation, we have:

$\mathbf{Z}^{(l)}=\mathbf{A}^{(l-1)}\mathbf{W}^{(l)\top}+\mathbf{B}^{(l)}$

where $\mathbf{B}^{(l)}$ is formed by stacking the bias vector $\mathbf{b}^{(l)}$ across all examples. For the nonlinear transformation, we have $\mathbf{A}^{(l)}=f(\mathbf{Z}^{(l)})$ . (Alternatively, you could define the weight matrix $\mathbf{W}$ as the transpose of our original definition to avoid the transpose in the linear transformation).

Note that since our previous discussion for single vectors was done element-wise, the derivation for the matrix case with batches of data follows a similar pattern.

Loss and Softmax

For the loss and softmax, $\frac{\partial \mathcal{L}}{\partial \mathbf{Z}^{(L)}_{i,j}}=\frac{1}{b}(\mathbf{A}^{(L)}_{i,j}-\mathbf{Y}_{i,j})$ . The assembly process is straightforward and leads to $\frac{\partial \mathcal{L}}{\partial \mathbf{Z}^{(L)}}=\frac{1}{b}(\mathbf{A}^{(L)}-\mathbf{Y})$ .

Gradients for Weights

For the gradients of weights, the update is equivalent to $\mathbf{Z}^{(l)\top}=\mathbf{W}^{(l)}\mathbf{A}^{(l-1)\top}+\mathbf{B}^{(l)\top}$ , which is more similar to the vector form. Since the weight is involved in the calculation for each example, using the multivariable chain rule, we have:

$\frac{\partial\mathcal{L}}{\partial \mathbf{W}_{i,j}^{(l)}}=\sum_{k=1}^{b}\frac{\partial \mathcal{L}}{\partial \mathbf{Z}_{i,k}^{(l)}}\frac{\partial \mathbf{Z}_{i,k}^{(l)}}{\partial \mathbf{W}_{i,j}^{(l)}}=\sum_{k=1}^b\mathbf{\Delta}_{i,k}^{(l)}\mathbf{A}_{k,j}^{(l-1)}$

After assembly, we have:

$\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}}=\mathbf{\Delta}^{(l)\top}\mathbf{A}^{(l-1)}$

Let's perform a quick shape check to verify our result. $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}}$ should have the same shape as $\mathbf{W}^{(l)}$ , which is $[m^{(l)}, m^{(l-1)}]$ . $\mathbf{\Delta}^{(l)\top}$ has shape $[m^{(l)}, b]$ , and $\mathbf{A}^{(l-1)}$ has shape $[b, m^{(l-1)}]$ . Thus, $\mathbf{\Delta}^{(l)\top}\mathbf{A}^{(l-1)}$ has shape $[m^{(l)}, m^{(l-1)}]$ , which matches $\mathbf{W}^{(l)}$ .

Backpropagation Through Layers

Finally, for the update from $\mathbf{\Delta}^{(l)}$ to $\mathbf{\Delta}^{(l-1)}$ , since each example is independent from the others, it's easy to see that $\mathbf{\Delta}^{(l-1)\top}=\mathbf{W}^{(l)\top}\mathbf{\Delta}^{(l)\top}\odot f'(\mathbf{Z}^{(l-1)\top})$ , which means $\mathbf{\Delta}^{(l-1)}=\mathbf{\Delta}^{(l)}\mathbf{W}^{(l)}\odot f'(\mathbf{Z}^{(l-1)})$ . This is a direct extension from the vector form.

Implementation

Now we have the complete derivation of the backpropagation algorithm:

\begin{align*} \mathbf{\Delta}^{(L)} & = \frac{1}{b}(\mathbf{A}^{(L)}-\mathbf{Y}) \\ \mathbf{\Delta}^{(l)} & = \mathbf{\Delta}^{(l+1)}\mathbf{W}^{(l+1)}\odot f'(\mathbf{Z}^{(l)}) \\ \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} & = \mathbf{\Delta}^{(l)\top}\mathbf{A}^{(l-1)} \\ \end{align*}

Now we can implement the backward pass easily. The pseudocode for a multilayer perceptron without bias is as follows. Note that the $\mathbf{W}$ matrix strictly follows the definition above and is consistent with PyTorch's nn.Linear.

# PyTorch-style API implementation
# W is a list of weight matrices
 
def forward_pass(X, W):
    L = len(W)
    A, Z = [], []
    for l in range(L-1):
        Z[l] = A[l-1] @ W[l].T
        A[l] = f(Z[l])
    Z[L-1] = A[L-2] @ W[L-1]
    A[L-1] = softmax(Z[L-1])
    L = -np.sum(Y * np.log(A[L-1]))
    return L, A, Z
 
def backward_pass(Y, A, Z, W):
    L = len(W)
    Delta, W_g = [], []
    Delta[L-1] = (A[L-1] - Y) / Y.shape[0]
    for l in range(L-2, 0, -1):
        Delta[l] = Delta[l+1] @ W[l+1] * d_f(Z[l])
        W_g[l] = Delta[l+1].T @ A[l-1]
    return W_g
 
L, A, Z = forward_pass(X, W)
W_g = backward_pass(Y, A, Z, W)  # Equivalent to PyTorch's L.backward()

A Detailed Derivation of Backpropagation

Outline

Preliminaries

Definitions

Forward Pass

The Last Layer

Gradients for Weights

Backpropagation Through Layers

Batch Processing

Loss and Softmax

Gradients for Weights

Backpropagation Through Layers

Implementation

Table of Contents