Skip to content

AutoDiff Puzzles

Published: at 05:32 AM in 19 min readSuggest Changes

My solutions to srush’s AutoDiff Puzzles. This is useful as a quick refresher for computing gradients.

Introduction

From the puzzle’s intro

Deep learning libraries like Torch utilize autodifferentiation of tensors to compute the parameter updates necessary to learn complex models from data. This technique is central to understanding why deep learning has become so widely used and effective. The autodifferentiation process is a neat trick that builds up a computational graph and then uses that graph to provide derivatives for user-constructed mathematical functions. At heart, this process is just an instantiation of the chain-rule based on implementations of every function and its derivative.

However, a library still needs to have efficient implementations of derivatives for its key building blocks. This sounds trivial -> just implement high school calculus. However, this is a bit more tricky than it sounds. Tensor-to-tensor functions are pretty complex and require keeping careful track of indexing on both the input and the output side.

Your goal in these puzzles is to implement the Jacobian, i.e. the derivative of each output cell with respect to each input cell, for each function below. In each case the function takes in a tensor xRIx\in \mathbb{R}^{I} and returns a tensor f(x)ROf(x) \in \mathbb{R}^{O}, so your job is to compute df(x)odxi\frac{d f(x)_o}{dx_i} for all indices o{0O1}o \in \{0\ldots O-1\} and i{0I1}i\in \{0\ldots I-1\}. If you get discouraged, just remember, you did this in high school (it just had way less indexing).

Rules and Tips

For each of the problems, a tensor function ff is provided. There are two solution formats:

  1. Given a lambda called f, your job is to fill in the function dx(i, j) which provides df(x)odxi\frac{d f(x)_o}{dx_i}. These puzzles are about correctness, not efficiency.
  2. Given some input and output tensors Is and Os and a function ff with Shaped Array types, fill out a function jac which returns the full Jacobian at a point xx (identical to jax.jacfwd(f)(x)).

Problem 1: Id

Warmup: f(x0)=[x0]f(x_0) = [x_0]

Solution: This warmup asks us to compute dxdx=1\frac{d x}{dx} = 1. The derivative is 1 since it’s just the identity function. More explicitly, we know that df(x)odxi\frac{df(x)_o}{dx_i} is 1 for o=i=0o = i = 0 since f(x)0=x0f(x)_0 = x_0.

def fb_id(x):
  f = lambda o: x[0]
  dx = lambda i, o: 1 # Fill in this line
  return f, dx

Is = np.arange(1)


def f(x: Shaped[Array, "1"]) -> Shaped[Array, "1"]:
    return 2 * x


def jac(x: Shaped[Array, "1"]) -> Shaped[Array, "1 1"]:
    return 2 * (x==x)[None]

Problem 2: Cosine

Warmup: f(x0)=[cos(x0)]f(x_0) = [\cos(x_0)]

Solution: The derivative of cos(x)\cos(x) is sin(x)-\sin(x). So we need to fill in the line dx = lambda i, o: -math.sin(x[0]).

def fb_cos(x):
    f = lambda o: math.cos(x[0])
    dx = lambda i, o: -math.sin(x)  # Fill in this line
    return f, dx
    import math

def f(x: Shaped[Array, "1"]) -> Shaped[Array, "1"]:
    return np.cos(x)

def jac(x: Shaped[Array, "1"]) -> Shaped[Array, "1 1"]:
    return -np.sin(x)[None]

Problem 3: Mean

f(x0,x1,,xI1)=[(x0+x1++xI)/I]f(x_0, x_1, \ldots, x_{I-1}) = [(x_0 + x_1 + \ldots + x_I) / I]

Solution: The Jacobian has shape 1×I1 \times I since the output only has one element. f(x)oxi=1I\frac{\partial f(x)o}{\partial x_i} = \frac{1}{I} for all ii.

def fb_mean(x):
    I = x.shape[0]
    f = lambda o: sum(x[i] for i in range(I)) / I
    dx = lambda i, o: 1 / I # Fill in this line
    return f, dx
    I = 10

Is = np.arange(I)


def f(x: Shaped[Array, "I"]) -> Shaped[Array, "1"]:
    return np.mean(x, axis=0, keepdims=True)


def jac(x: Shaped[Array, "I"]) -> Shaped[Array, "1 I"]:
    return 1 / x.shape[0] * (x == x)[None]

Problem 4: Product

f(x0,x2,,xI1)=x1x2xI1f(x_0, x_2, \ldots, x_{I-1}) = x_1 x_2 \ldots x_{I-1}

Solution: The Jacobian has shape 1×I1 \times I since the output only has one element. For a given xix_i, the derivative is the product of all the other elements. So we can write this as f(x)oxi=jixj\frac{\partial f(x)_o}{\partial x_i} = \prod_{j \neq i} x_j.

def fb_prod(x):
    pr = torch.prod(x)
    f = lambda o: pr
    dx = lambda i, o: pr / x[i] if x[i] != 0 else 0
    return f, dx

def f(x: Shaped[Array, "I"]) -> Shaped[Array, "1"]:
    return np.prod(x, keepdims=True)


def jac(x: Shaped[Array, "I"]) -> Shaped[Array, "1 I"]:
    pr = f(x)
    return (pr / x)[None]

Problem 5: Repeat

f(x0)=[x0,x0,x0,x0]f(x_0) = [x_0, x_0, x_0, \ldots x_0]

Hint: The function dx should return a scalar. It is the derivative of f(x0)of(x_0)_o, i.e. the o’th output.

Solution: The Jacobian has shape O×1O \times 1 since the output has OO elements. The derivative is 1 for all ii.

def fb_repeat(x):
    f = lambda o: x[0]
    dx = lambda i, o: 1
    return f, dx
    Is = np.arange(1)

O = 10
Os = np.arange(O)[:, None]


def f(x: Shaped[Array, "1"]) -> Shaped[Array, "O"]:
    return (x + (Os * 0 + 1))[:, 0]


def jac(x: Shaped[Array, "1"]) -> Shaped[Array, "O 1"]:
    return (x == x)[None]

Problem 6: Repeat and Scale

f(x0)=[x0×0/I,x0×2/I,x0×3/I,,x0×(I1)/I]f(x_0) = [x_0 \times 0/I, x_0 \times 2/I, x_0 \times 3/I, \ldots, x_{0} \times (I-1)/I]

Solution: The scalar in front of the repeated version of the input depends on its index. In particular, the oo-th element is x0×o/Ix_0 \times o/I. So the derivative is o/Io/I.

def fb_repeat_scale(x):
    I = 50
    f = lambda o: x[0] * (o / I)
    dx = lambda i, o: (o / I)
    return f, dx

Is = np.arange(1)
O = 10
Os = np.arange(O)[:, None]


def f(x: Shaped[Array, "1"]) -> Shaped[Array, "O"]:
    return x * (Os / O)[:, 0]


def jac(x: Shaped[Array, "1"]) -> Shaped[Array, "O 1"]:
    return Os / O

Problem 7: Negation

f(x0,x1,)=[x0,x1,]f(x_0, x_1, \ldots) = [-x_0, -x_1, \ldots]

(Hint: remember the indicator trick, i.e.

(a == b) * 27 # 27 if a == b else 0

Solution: Here, the Jacobian has shape I×II \times I with I=OI=O since the input/output has II elements. The Jacobian is a diagonal matrix with -1 on the diagonal.

def fb_neg(x):
    f = lambda o: -x[o]
    dx = lambda i, o: -(i == o)
    return f, dx

I = 10
O = 10
Is = np.arange(I)
Os = np.arange(O)[:, None]


def f(x: Shaped[Array, "I"]) -> Shaped[Array, "O"]:
    return -x


def jac(x: Shaped[Array, "I"]) -> Shaped[Array, "O I"]:
    return (0 - (Os ==Is[None])).astype(float)

Problem 8: ReLU

f(x0,x1,)=[relu(x0),relu(x1),]f(x_0, x_1, \ldots) = [\text{relu}(x_0), \text{relu}(x_1), \ldots]

Recall

relu(x)={0x<0xx>=0\text{relu}(x) = \begin{cases} 0 & x < 0 \\ x & x >= 0 \end{cases}

(Note: you can ignore the not of non-differentiability at 0.)

Solution: ReLU is an element-wise function, so we know the Jacobian is a diagonal matrix. The derivative is 0 if xi<0x_i < 0 and 1 otherwise.

def fb_relu(x):
    f = lambda o: (x[o] > 0) * x[o]
    dx = lambda i, o: (i == o) * (x[o] > 0)
    return f, dx
    I = 10

O = 10
Is = np.arange(I)
Os = np.arange(O)[:, None]


def f(x: Shaped[Array, "I"]) -> Shaped[Array, "O"]:
    return x * (x > 0)


def jac(x: Shaped[Array, "I"]) -> Shaped[Array, "O I"]:
    # x.shape (I, )
    # (Os == Is).shape (O, I)
    # Broadcasting (1, I) * (O, I)
    return (x > 0) * (Os == Is)

Problem 8.5/9: Index

f(x0,x1,,x24)=[x10,x11,,x24]f(x_0, x_1, \ldots, x_{24}) = [x_{10}, x_{11}, \ldots, x_{24}]

Solution: The Jacobian is a 15x25 matrix. x0x9x_0 \dots x_9 are not used in the output, so the derivative is 0 for those indices. The outputs are just the inputs shifted by 10, so the derivative is 1 for i=o+10i = o + 10 and 0 otherwise.

# i o dx
# 0 0 0
# ...
# 10 0 1
# 11 1 1
# ...

def fb_index(x):
    f = lambda o: x[o + 10]
    dx = lambda i, o: 1 if i == (o + 10) else 0
    return f, dx

I = 25
O = 15
Is = np.arange(I)
Os = np.arange(O)[:, None]


def f(x: Shaped[Array, "I"]) -> Shaped[Array, "O"]:
    return x[10:]


def jac(x: Shaped[Array, "I"]) -> Shaped[Array, "O I"]:
    return Is == (Os + 10)s

Problem 10: Cumsum

f(x0,x1,)=[i=00xi,i=01xi,i=02xi,,]/20f(x_0, x_1, \ldots) = [\sum^0_{i=0} x_{i}, \sum^1_{i=0} x_{i}, \sum^2_{i=0} x_{i}, \ldots, ] / 20