Two-layer MLP backpropagation using differential forms - Matrix Gradients by hand in Deep Learning (1)
TL;DR This article is a deep dive into the calculation of the gradient of a function defined on a matrix space. We will see how to define a differential form and how to calculate the gradient of a function defined on a matrix space. We will also see how to apply the chain rule in deep learning. We will see how to calculate the gradient of a neural network with a two-layer architecture.
We suppose that all functions are differentiable.
Introduction: Jacobian and gradient for vector functions
This section can be skipped if the reader is not familiar with differential geometry or the concept of the differential form.
From a higher point of view, the gradient, in finite dimensions or Hilbert Spaces, is the vectorial representation of a linear form in the dual space of the tangent space. See Riesz representation theorem.
Thus, the gradient here is supposed to have the same dimension as the domain of the function.
Consider a function f:Rn→Rm, the Jacobian matrix of f is the matrix of all first-order partial derivatives of f.
J(f)=∂x1∂f1⋮∂x1∂fm⋯⋱⋯∂xn∂f1⋮∂xn∂fm
Consider a scalar function g:Rn→R, the gradient of g is the transpose of the Jacobian matrix of g.
∇g=J(g)⊤=[∂x1∂g⋯∂xn∂g]⊤=∂x1∂g⋮∂xn∂g
The famous chain rule of differentiation can be expressed in terms of the Jacobian matrix.
Let f:Rn→Rm and g:Rm→Rp, then the Jacobian matrix of the composition g∘f is the product of the Jacobian matrices of f and g.
J(g∘f)(x)=J(g)(f(x))⋅J(f)(x)
Thus we have to be careful with the order of the Jacobian matrices when applying the chain rule with the gradient.
Let f:Rn→Rm and g:Rm→R, then the gradient of the composition g∘f is the product of the gradient of f and the Jacobian matrix of g.
∇(g∘f)=(J(g)⋅J(f))⊤=J(f)⊤⋅J(g)⊤=(∇f)(x)⋅(∇g)(f(x))
be careful with the shape, here the jacobian matrix of g is 1×m and the gradient of f is m×n, whereas the gradient of g is m×1, and the gradient of g∘f is n×1.
The most tedious part of the chain rule is to keep track of the shapes of the Jacobian matrices and the gradients.... especially when the function is defined on a matrix space or a tensor space.
The jacobian becomes somewhat pathological when the function is defined on a matrix space, because the gradient is always the same shape as the domain of the function (representation of the linear form in the dual space), for example when the function is defined on a matrix space.
Next we will see how to compute the gradient of a function defined on a matrix space.
General differential forms
Scalar function of a matrix variable
Suppose we have a function L that takes a tensor X as input and outputs a scalar.
- L is a scalar function of a matrix variable X, for example with dimension (m×n).
We want to define and compute
∂X∂Lor∇XL,
where ∂X∂L has the same shape of X (m×n), whose element in the position (i,j) is ∂Xi,j∂L.
[∂X∂L]i,j=∂Xi,j∂L.
In higher order notation, we introduce the concept of differentials. If L is differentiable, then for a small change dX in the matrix X, we have
dL=i=1∑mj=1∑n∂Xi,j∂LdXi,j
Or simply with Einstein notation:
dL=∂Xi,j∂LdXi,j.
TipIf the reader is familiar with the dual space, then he/she will recognize that the notion of small change is actually a differential form.
Important Frobenius productWe will use this following formula many times in the next sections.
We can write dL as a Frobenius product of the gradient of L and the differential of X.
dL=tr((∂X∂L)⊤dX)
This is obtained by the fact that tr(ATB)=Ai,jBi,j.
Tensor function of a tensor variable
Suppose we have a function Y∈Rm×n that takes a tensor X∈Rp×q as input and outputs a tensor.
We want to define and compute
∂X∂Yor∇XY,
where ∂X∂Y is a tensor of shape (m×n×p×q), whose element in the position (i,j,k,ℓ) is ∂Xk,ℓ∂Yi,j.
[∂X∂Y]i,j,k,ℓ=∂Xk,ℓ∂Yi,j
The general form of the chain rule for tensor functions
Let X be a matrix of shape (m×n) and Y be a matrix of shape (p×q), and Z be a scalar function of Y. Then the chain rule for the tensor function Z is:
∂Xi,j∂Z=∂Yk,ℓ∂Z⋅∂Xi,j∂Yk,ℓ
Until now, we have not seen how to calculate the gradient of a function defined on a matrix space and the backpropagation algorithm in deep learning. We will first recall the rules for calculating differential forms.
Rules for calculating differential forms
The differentiation d has a strict definition in differential geometry, but in practice, we can use the following rules to calculate the differential forms.
d(X±Y)=dX±dYd(XY)=XdY+YdXd(X⊤)=(dX)⊤d(tr(X))=tr(dX)
If X is invertible, then we have
d(X−1)=−X−1dXX−1d(det(X))=det(X)tr(X−1dX)
Element-wise operations:
d(X⊙Y)=Y⊙dX+X⊙dYd(X⊘Y)=Y2dX⊙Y−X⊙dY
Element-wise functions:
dσ(X)=σ′(X)⊙dX
Gradient calculation in practice
The following examples of this section are taken from the Zhihu article, which I strongly recommend for more exercises!
Other recommended resources:
Given df, we want to calculate ∂X∂f. We can write df in the form of df=tr((∂X∂f)⊤dX). To do this, we need some tricks to calculate the trace.
Tricks for calculating traces
Tricks
- Scalar rule: if a is a scalar, then tr(a)=a.
- Transpose rule: tr(A)=tr(A⊤).
- Linearity: tr(A+B)=tr(A)+tr(B).
- Cyclic rule: tr(AB)=tr(BA).
- Product rule: tr(A⊤(B⊙C))=tr((A⊙B)⊤C).
Frobenius ProductLet A and B be two matrices of the same shape. The Frobenius product of A and B is defined as
tr(A⊤B)=i,j∑Ai,jBi,j
This is a generalization of the dot product for matrices.
Examples
- f=aTXb, where a is a (m×1) vector, b is a (n×1) vector, and X is a (m×n) matrix. Then we have
df=daTXb+aTdXb+aTXdb=aTdXb=tr(aTdXb)=tr(baTdX)=tr((abT)⊤dX)
Thus we have ∂X∂f=abT
Note we cannot write ∂X∂f=aT∂X∂Xb, since the partial derivate does not commute with matrix multiplication.- f=a⊤exp(Xb), where a is a (m×1) vector, b is a (n×1) vector, and X is a (m×n) matrix. Then we have
df=(da⊤)exp(Xb)+a⊤(dexp(Xb))=a⊤(exp(Xb)⊙d(Xb))=tr(a⊤(exp(Xb)⊙d(Xb)))=tr((a⊤exp(Xb))⊤⊙d(Xb))=tr(b(a⊙exp(Xb))⊤dX)=tr(((a⊙exp(Xb))b⊤)⊤dX)
Thus we have ∂X∂f=a⊙exp(Xb)b⊤.
- (Backpropagation) Y=f(X), and L(Y) is a scalar function of Y. We have already calculated ∂Y∂L(Y).
We want to calculate ∂X∂L(X).
We have
dL=tr((∂Y∂L(Y))⊤dY)=tr((∂Y∂L)⊤df(X))
Meanwhile, we have
df=tr((∂X∂f)⊤dX)
by developing df(X) with differential rules, we can obtain the expression of ∂X∂L(X).
Application in deep learning 1: Neural Network with two layers
Let X be the input matrix of shape (N×D), where N is the number of samples and D is the number of features, and y be the vector of shape (N,), where yi∈{0,1,...,C−1} is the class of the i-th sample.
Consider a 2-layer neural network with the following architecture:
Z1H1Z2P=XW1+b1=σ(Z1)=H1W2+b2=Softmax(Z2)
where
- X is the input matrix of shape (N×D),
- W1 is the weight matrix of the first layer of shape (D×H),
- b1 is the bias row vector of the first layer of shape (H),
- W2 is the weight matrix of the second layer of shape (H×C),
- b2 is the bias row vector of the second layer of shape (C),
- Z1 is the output of the first layer of shape (N×H),
- H1 is the activation of the first layer of shape (N×H),
- Z2 is the output of the second layer of shape (N×C),
- P is the output of the network of shape (N×C).
- Y is the one-hot encoded target matrix of shape (N×C). Yi,c=δc,yi.
In other words, we have the following architecture:
P=Softmax(W2(σ(XW1+b1))+b2)
We note Y the target matrix, i.e. the one-hot encoded matrix of shape (N×C), where Yi,c=δc,yi.
We use the cross-entropy loss function:
L=−N1i=1∑NlogPi,yi=−N1i=1∑Nj=1∑CYi,jlogPi,j=−N1tr(Y⊤logP)
We leave the reader to verify that the L is a Frobenius product of the target matrix Y and the log-softmax output P.
∂L/∂Z2
This is the famous cross-entropy loss on the softmax output. Let's use the differential form of the loss function.
dL=−N1i=1∑Nj=1∑C(Yi,jPi,j1)dPi,j
Develop the differential form of dPi,j. Since the softmax function is an row-wise operation, we have
dPi,j=dSoftmax(Z2,i,j)=k=1∑C∂Z2,i,k∂Pi,jdZ2,i,k
Meanwhile,
∂Pi,j/∂Z2,i,k=∂Z2,i,k∂(∑l=1Cexp(Z2,i,l)exp(Z2,i,j))=∑l=1Cexp(Z2,i,l)exp(Z2,i,j)δj,k−(∑l=1Cexp(Z2,i,l))2exp(Z2,i,j)exp(Z2,i,k)=Pi,jδj,k−Pi,jPi,k=Pi,j(δj,k−Pi,k)
Substituting into the expression for dL:
dL=−N1i=1∑Nj=1∑C(Yi,jPi,j1)k=1∑CPi,j(δj,k−Pi,k)dZ2,i,k=−N1i=1∑Nj=1∑Ck=1∑CYi,j(δj,k−Pi,k)dZ2,i,k=−N1i=1∑Nk=1∑C(j=1∑CYi,j(δj,k−Pi,k))dZ2,i,k
Focus on the inner sum:
- When δj,k=1, the term simplifies to Yi,k.
- The second term sums over j as −Pi,k∑j=1CYi,j.
Since ∑j=1CYi,j=1 (only one class is true for each sample):
dL=−N1i=1∑Nk=1∑C(Yi,k−Pi,k)dZ2,i,k
Then we can vectorize this calculation.
dL=−N1tr((Y−P)⊤dZ2)=tr((NP−Y)⊤dZ2)
We have
∂Z2∂L=N1(P−Y)
∂L/∂W2 and ∂L/∂b2
dZ2=d(H1W2+b2)=dH1W2+H1dW2+db2
Then
dL=tr((∂Z2∂L)⊤(dH1W2+H1dW2+db2))=dL1tr((∂Z2∂L)⊤dH1W2)+dL2tr((∂Z2∂L)⊤H1dW2)+dL3tr((∂Z2∂L)⊤db2)
Since the first term and the third term are independent of dW2, we obtain
∂W2∂L=∂W2∂L2=H1⊤∂Z2∂L
For the gradient of b2 we had gotta be careful. Since b2 is broadcasted, we have
dL3=tr((∂Z2∂L)⊤d(1Nb2))=tr((∂Z2∂L)⊤1Ndb2)=tr((1N⊤∂Z2∂L)⊤db2)
where 1N is a column vector of ones of shape (1,N).
Thus we have
∂b2∂L=1N⊤∂Z2∂L=N1i=1∑N(P[i,:]−Y[i,:])
∂L/∂H1
Repeat the same calculation for dL1 and dH1.
dL1=tr((∂Z2∂L)⊤dH1W2)=tr(W2(∂Z2∂L)⊤dH1)=tr((∂Z2∂LW2⊤)⊤dH1)
Thus we have
∂H1∂L=∂Z2∂LW2⊤
∂L/∂Z1
Repeat
dL1=tr((∂H1∂L)⊤dH1)=tr((∂H1∂L)⊤(σ′(Z1)⊙dZ1))=tr((∂H1∂L⊙σ′(Z1))⊤dZ1)
Thus we have
∂Z1∂L=∂H1∂L⊙σ′(Z1)
∂L/∂W1 and ∂L/∂b1
Repeat...
dL=tr((∂Z1∂L)⊤dZ1)=tr((∂Z1∂L)⊤(dXW1+XdW1+db1))=tr((∂Z1∂L)⊤XdW1)+tr((∂Z1∂L)⊤d(1Nb1))=tr((∂Z1∂L)⊤XdW1)+tr((∂Z1∂L)⊤1Ndb1)
Then finally we have
∂W1∂L=X⊤∂Z1∂L
∂b1∂L=1N⊤∂Z1∂L
Summary
The regularized loss function is
L=−N1tr(Y⊤logP)+λ(W1,i,j2+W2,i,j2)
∂Z2∂L=N1(P−Y)
∂W2∂L=H1⊤∂Z2∂L+2λW2
∂b2∂L=1N⊤∂Z2∂L
∂H1∂L=∂Z2∂LW2⊤
∂Z1∂L=∂H1∂L⊙σ′(Z1)
∂W1∂L=X⊤∂Z1∂L+2λW1
∂b1∂L=1N⊤∂Z1∂L
We can observe that the gradient of the layer deepens on the gradient of the next layer. This is the principle of backpropagation in deep learning.
Implementation
# @Credits: UMich EECS 498-007 / 598-005 Deep Learning for Computer Vision
# Solution is my own implementation of the forward and backward pass of a two-layer neural network
def nn_forward_backward(
params: Dict[str, torch.Tensor],
X: torch.Tensor,
y: Optional[torch.Tensor] = None,
reg: float = 0.0
):
"""
Compute the loss and gradients for a two layer fully connected neural
network. When you implement loss and gradient, please don't forget to
scale the losses/gradients by the batch size.
Inputs: First two parameters (params, X) are same as nn_forward_pass
- params: a dictionary of PyTorch Tensor that store the weights of a model.
It should have following keys with shape
W1: First layer weights; has shape (D, H)
b1: First layer biases; has shape (H,)
W2: Second layer weights; has shape (H, C)
b2: Second layer biases; has shape (C,)
- X: Input data of shape (N, D). Each X[i] is a training sample.
- y: Vector of training labels. y[i] is the label for X[i], and each y[i] is
an integer in the range 0 <= y[i] < C. This parameter is optional; if it
is not passed then we only return scores, and if it is passed then we
instead return the loss and gradients.
- reg: Regularization strength.
Returns:
If y is None, return a tensor scores of shape (N, C) where scores[i, c] is
the score for class c on input X[i].
If y is not None, instead return a tuple of:
- loss: Loss (data loss and regularization loss) for this batch of training
samples.
- grads: Dictionary mapping parameter names to gradients of those parameters
with respect to the loss function; has the same keys as self.params.
"""
# Unpack variables from the params dictionary
W1, b1 = params["W1"], params["b1"]
W2, b2 = params["W2"], params["b2"]
N, D = X.shape
scores, h1 = nn_forward_pass(params, X)
# If the targets are not given then jump out, we're done
if y is None:
return scores
# Compute the loss
normalized_scores = scores - scores.max(dim=1, keepdim=True).values
torch.exp(normalized_scores, out=normalized_scores)
normalized_scores /= normalized_scores.sum(dim=1, keepdim=True)
loss = torch.mean(-torch.log(normalized_scores[torch.arange(N), y]))
loss += reg * (torch.sum(W1 * W1) + torch.sum(W2 * W2))
# Backward pass: compute gradients
grads = {}
grad_z2 = normalized_scores
grad_z2[torch.arange(N), y] -= 1 # For correct class, subtract 1
grad_z2 /= N # Scale by batch size
grads['W2'] = h1.T @ grad_z2 + 2 * reg * W2 # Add regularization gradient
grads['b2'] = grad_z2.sum(dim=0) # Bias gradient (sum over N)
# Compute gradient with respect to W1, b1
grad_h1 = grad_z2 @ W2.T # Backprop to first hidden layer
grad_z1 = grad_h1 * (h1 > 0) # Apply ReLU derivative
grads['W1'] = X.T @ grad_z1 + 2 * reg * W1 # Add regularization gradient
grads['b1'] = grad_z1.sum(dim=0) # Bias gradient (sum over N)
return loss, grads