In this post, we’ll walk through how to compute the Jacobian of a matrix product. We’ll start with a small example and then generalize to larger matrices. Because these sorts of Jacobians can be large and unwieldy , it’s helpful to visualize the derivation to understand the structure.
Simple Example: 2x2 Matrix Product
Problem Setup
We’ll start with a very simple 2x2 matrix product:
Let
A=(acbd),B=(wyxz).
We want the product
C=AB=(acbd)(wyxz)=(aw+bycw+dyax+bzcx+dz).
Denote the entries of C by
C=(c1c3c2c4)=(aw+bycw+dyax+bzcx+dz).
We will treat C as a function of the entries of A, holding B fixed. In other words, we are computing the Jacobian wrt. A. We want to understand the Jacobian of
c=(c1,c2,c3,c4)
with respect to
a=(a,b,c,d).
Write Each ci in Terms of (a,b,c,d)
From the product C=AB, we identify:
c1=aw+by
c2=ax+bz
c3=cw+dy
c4=cx+dz
Compute Partial Derivatives
We compute all partial derivatives ∂α∂ci for α∈{a,b,c,d} and i=1,2,3,4.
For c1=aw+by
∂a∂c1=w,∂b∂c1=y,∂c∂c1=0,∂d∂c1=0.
For c2=ax+bz
∂a∂c2=x,∂b∂c2=z,∂c∂c2=0,∂d∂c2=0.
For c3=cw+dy
∂a∂c3=0,∂b∂c3=0,∂c∂c3=w,∂d∂c3=y.
For c4=cx+dz
∂a∂c4=0,∂b∂c4=0,∂c∂c4=x,∂d∂c4=z.
Arrange These Partials into a Jacobian Matrix
When we speak of “the Jacobian,” we typically stack the outputs c as rows (or as one long column) and do the same with the parameters a. In one common convention (outputs as rows, parameters as columns), we get a 4×4 matrix:
This is the Jacobian of (c1,c2,c3,c4) with respect to (a,b,c,d).
Index-wise, you can think of c1,c2,c3,c4 as c1,1,c1,2,c2,1,c2,2 (i.e. first row of C is (c1,c2), second row is (c3,c4)). Similarly, (a,b,c,d) can be mapped to (a1,1,a1,2,a2,1,a2,2).
Hence, each row index of J can be viewed as (i,j)∈{1,2}×{1,2} for ci,j, and each column index of J can be viewed as (k,ℓ)∈{1,2}×{1,2} for ak,ℓ. So J can be written as:
J(i,j),(k,ℓ)=∂ak,ℓ∂ci,j.
In this sense, the single matrix J can be reshaped into a 4D array of shape (2,2,2,2),
Visualizing the Pattern
Even for larger matrices, you can see the pattern:
The top-left block depends on the first row-block of A (i.e., a,b) and the relevant columns of B.
The bottom-right block depends on the second row-block of A (i.e., c,d) and the relevant columns of B.
Zeros appear outside the blocks corresponding to the relevant multiplication.
Another way to see this is via vec-operator identities (if you flatten each matrix into a long column vector). For instance, the well-known identity
vec(AB)=(BT⊗I)vec(A)
tells you exactly how to build the Jacobian by Kronecker products. In the 2×2 case, BT is
(wxyz),
and I is the 2×2 identity matrix. Then
BT⊗I=(wIxIyIzI)=w0x00w0xy0z00y0z,
which matches the same structure we saw by direct partial derivatives (just with a slightly different row/column ordering convention depending on how exactly you flatten).
Generalization
Suppose Y=AB where
A is an (m×p) matrix (entries Aij),
B is a (p×n) matrix,
Y is then an (m×n) matrix with entries Yik.
A very direct way to express the partial derivatives componentwise is to notice:
∂Aαβ∂Yik=0 unless i=α and j=β. When i=α and j=β, it equals Bβ,k.
Often, this is written using an indicator (or Iverson bracket) as:
∂Aαβ∂Yik1{i=α}Bβ,k
Another common notation uses the Kronecker delta δ. We can compactly write:
∂Aαβ∂(AB)ikδi,αBβ,k.
In vectorized form (using Kronecker products)
When working with gradients or Jacobians involving many matrix entries at once, it is typical to vectorize the matrices. Recall that
vec(Y)=vec(AB)=(BT⊗Im)vec(A),
where
vec(⋅) stacks the columns of a matrix into a single column vector,
⊗ denotes the Kronecker product,
Im is the m×m identity matrix,
BT is the transpose of B.
From that identity, you can read off that the total Jacobian matrix (of size (mn×mp)) that maps vec(A) to vec(Y) is precisely:
∂vec(A)∂vec(Y)BT⊗Im.
That is a “one-shot” formula: once you memorize
vec(AB)=(BT⊗I)vec(A),
you effectively know the Jacobian as well.
For an (m×n) matrix A and an (n×p) matrix B, the product C=AB is (m×p).
Flattening A into an (mn×1) vector and C into an (mp×1) vector, the Jacobian ∂vec(A)∂vec(C) is a block matrix of size (mp×mn), which can be written explicitly as:
BT⊗Im(using the Kronecker product).
Each partial derivative ∂Akl∂Cij is nonzero only when i=k and it equals Bℓj when l=ℓ. That is the “delta” structure you see in the small example.