There are several approaches to calculating backpropagation (BP) for multi-layer perceptrons (MLPs). Our goal is to compute the gradient of the loss with respect to the weight matrix for each layer, which represents a scalar-to-matrix derivative. We can use the following methods:
Calculate ∂W(l)∂L directly using matrix-to-matrix gradients. (We won't use this approach because matrix-to-matrix gradients are difficult to compute)
Calculate ∂W(l)∂L directly while avoiding vector-to-matrix gradients. (We won't use this approach either, as it's still quite challenging)
Calculate ∂Wi,j(l)∂L for each element and then assemble them into a matrix. (This is the approach we'll adopt)
For our chosen method, while computing ∂Wi,j(l)∂L is sufficient to determine all gradients and you could write a for-loop to complete the update, this approach isn't efficient. Modern accelerators like GPUs can significantly speed up matrix multiplication operations. Therefore, we still need to assemble the scalar results into matrix form.
We'll first examine the gradient calculation for a single example. Since the SGD algorithm requires a batch of examples, we'll then extend our results to handle batches of data.
Preliminaries
To make our calculations more straightforward, we'll first introduce three key concepts you should understand: denominator layout, the multivariable chain rule, and matrix assembly.
Denominator Layout: In the deep learning community, derivatives are computed using the denominator layout by default. This means that a scalar-to-matrix gradient will result in a matrix with the same shape as the original matrix. If the result is the transpose of the original matrix, then we're using the denominator layout. You can learn more about denominator layout from this Wikipedia article.
Multivariable Chain Rule: We need to understand the multivariable chain rule. If x=x(t), y=y(t), and z=f(x,y), then we have ∂t∂z=∂x∂f∂t∂x+∂y∂f∂t∂y. Here's an article on the multivariable chain rule. Since we calculate gradients in a scalar manner, a function that accepts a vector or matrix becomes a multivariable function. For example, f(x) becomes f(x0,x1,⋯,xn). Therefore, in the following derivation, we'll always use the multivariable chain rule.
Matrix Assembly: Finally, we need to understand how to assemble a vector or matrix. For a matrix W, we can write it as W=[Wi,j]i,j, where the indices i and j indicate that we need to iterate through each row and column. Similarly, for a vector, we have v=[vi]i. Assembly can be thought of as the reverse process of matrix multiplication. For example, the multiplication of two column vectors xy⊤=[xiyj]i,j. Thus, if we obtain a scalar result where Wi,j=xiyj, we can assemble it into the matrix W=xyT.
Definitions
Notation:
Scalars: x,y,⋯
Vectors: x,y,⋯
Matrices: X,Y,⋯
Subscript notation: xi represents the i-th element of vector x, which is a scalar.
Indicator function: 1ij equals 1 if i=j and 0 otherwise.
Network Architecture:
Input: x with shape [n,1]
Label: y with shape [c,1] (one-hot encoded)
Number of layers: L
Number of classes: c
Linear transformation: Wx+b
Weight matrix at layer l: W(l)
Weight Matrix Shapes:
First layer: W(1) with shape [m(1),m(0)=n]
Hidden layers (from 2 to L−1): W(l) with shape [m(l),m(l−1)]
Last layer: W(L) with shape [c,m(L−1)]
Activation Function and Activations:
Activation function: f
Activation at layer l: a(l)
Activation Shapes:
Input: a(0) with shape [n,1]
Hidden layer activations (from 1 to L−1): a(l) with shape [m(l),1]
where the cross-entropy loss is L=CE(a(L),y)=−∑i=1cyilog(ai(L)).
Our goal is to calculate the gradient of L with respect to W(l) and b(l). In this derivation, we'll focus on computing the gradient with respect to W(l). The gradient with respect to b(l) follows a similar pattern.
The Last Layer
Since the last layer differs from the other layers (it uses softmax instead of a regular activation function), we'll calculate its gradient separately.
In the forward pass, ai(L)=∑k=1cezkezi represents the normalized probability output by the softmax layer. A straightforward calculation of the softmax gradient yields ∂zj(L)∂ai(L)=ai(L)(1ij−aj(L)).
To calculate the gradients ∂zj(L)∂L, we use the chain rule by introducing the activations a. An important point here is that when we work in scalar form, most functions we encounter are multivariable functions, so we need to use the multivariable chain rule. For example, L is the result of the cross-entropy function applied to a0(L),a1(L),…,ac−1(L). According to the multivariable chain rule, we have ∂zj(L)∂L=∑i=1c∂ai(L)∂L∂zj(L)∂ai(L).
Now we assemble ∂zj(L)∂L into a vector: ∂z(L)∂L=a(L)−y. The detailed assembly process is ∂z(L)∂L=[∂zj(L)∂L]j=1c=[aj(L)−yj]j=1c=[aj(L)]j=1c−[yj]j=1c=a(L)−y.
Gradients for Weights
We can define the error term as:
δ(l)=∂z(l)∂L
which is a column vector that simplifies our chain rule calculations. For the last layer, δ(L)=a(L)−y.
If we can calculate δ(l) for all layers l, then we can calculate ∂W(l)∂L for all layers. This calculation uses the multivariable chain rule. Since we want to use the chain rule to connect L and W through z, we need to use the multivariable chain rule: ∂Wi,j(l)∂L=∑k∂zk(l)∂L∂Wi,j(l)∂zk(l).
Let's recall the matrix multiplication in z(l)=W(l)a(l−1)+b(l). We have zi(l)=∑j=1nWi,j(l)aj(l−1)+bi(l). From this, we can see that ∂Wi,j(l)∂zi(l)=aj(l−1), and ∂Wi,j(l)∂zk(l)=0 for i=k (that is, Wi,j(l) only affects the calculation of zi(l)). Therefore, we have:
The second step follows because zj(l−1) contributes to every zk(l) through the linear transformation. The third step is because zj(l−1) only affects aj(l−1) through the nonlinear transformation.
Since aj(l−1)=f(zj(l−1)), we have ∂zj(l−1)∂aj(l−1)=f′(zj(l−1)). Since z(l)=W(l)a(l−1)+b(l), from the matrix multiplication, we have ∂aj(l−1)∂zk(l)=Wk,j(l). Therefore, we have:
δj(l−1)=∑kδk(l)Wk,j(l)f′(zj(l−1))
Now let's assemble the result into vector form. We have:
We can extend the above calculations to handle batches of data. In our previous discussion, each example was represented as a column vector, but in deep learning programming, we typically represent examples as rows in a matrix. We have X=[x1⊤,x2⊤,⋯,xb⊤] with shape [b,n], where xi is a column vector. Similarly, we have Y=[y1,y2,⋯,yb] with shape [b,c], where yi is a column vector.
The total loss is L=b1∑i=1bL(xi,yi). The vectors a(l),z(l),δ(l) become matrices A(l),Z(l),Δ(l) respectively, all with shape [b,m(l)]. For the linear transformation, we have:
Z(l)=A(l−1)W(l)⊤+B(l)
where B(l) is formed by stacking the bias vector b(l) across all examples. For the nonlinear transformation, we have A(l)=f(Z(l)). (Alternatively, you could define the weight matrix W as the transpose of our original definition to avoid the transpose in the linear transformation).
Note that since our previous discussion for single vectors was done element-wise, the derivation for the matrix case with batches of data follows a similar pattern.
Loss and Softmax
For the loss and softmax, ∂Zi,j(L)∂L=b1(Ai,j(L)−Yi,j). The assembly process is straightforward and leads to ∂Z(L)∂L=b1(A(L)−Y).
Gradients for Weights
For the gradients of weights, the update is equivalent to Z(l)⊤=W(l)A(l−1)⊤+B(l)⊤, which is more similar to the vector form. Since the weight is involved in the calculation for each example, using the multivariable chain rule, we have:
Let's perform a quick shape check to verify our result. ∂W(l)∂L should have the same shape as W(l), which is [m(l),m(l−1)]. Δ(l)⊤ has shape [m(l),b], and A(l−1) has shape [b,m(l−1)]. Thus, Δ(l)⊤A(l−1) has shape [m(l),m(l−1)], which matches W(l).
Backpropagation Through Layers
Finally, for the update from Δ(l) to Δ(l−1), since each example is independent from the others, it's easy to see that Δ(l−1)⊤=W(l)⊤Δ(l)⊤⊙f′(Z(l−1)⊤), which means Δ(l−1)=Δ(l)W(l)⊙f′(Z(l−1)). This is a direct extension from the vector form.
Implementation
Now we have the complete derivation of the backpropagation algorithm:
Now we can implement the backward pass easily. The pseudocode for a multilayer perceptron without bias is as follows. Note that the W matrix strictly follows the definition above and is consistent with PyTorch's nn.Linear.
# PyTorch-style API implementation# W is a list of weight matricesdef forward_pass(X, W): L = len(W) A, Z = [], [] for l in range(L-1): Z[l] = A[l-1] @ W[l].T A[l] = f(Z[l]) Z[L-1] = A[L-2] @ W[L-1] A[L-1] = softmax(Z[L-1]) L = -np.sum(Y * np.log(A[L-1])) return L, A, Zdef backward_pass(Y, A, Z, W): L = len(W) Delta, W_g = [], [] Delta[L-1] = (A[L-1] - Y) / Y.shape[0] for l in range(L-2, 0, -1): Delta[l] = Delta[l+1] @ W[l+1] * d_f(Z[l]) W_g[l] = Delta[l+1].T @ A[l-1] return W_gL, A, Z = forward_pass(X, W)W_g = backward_pass(Y, A, Z, W) # Equivalent to PyTorch's L.backward()