My solutions to srush’s AutoDiff Puzzles. This is useful as a quick refresher for computing gradients.
Introduction
From the puzzle’s intro
Deep learning libraries like Torch utilize autodifferentiation of tensors to compute the parameter updates necessary to learn complex models from data. This technique is central to understanding why deep learning has become so widely used and effective. The autodifferentiation process is a neat trick that builds up a computational graph and then uses that graph to provide derivatives for user-constructed mathematical functions. At heart, this process is just an instantiation of the chain-rule based on implementations of every function and its derivative.
However, a library still needs to have efficient implementations of derivatives for its key building blocks. This sounds trivial -> just implement high school calculus. However, this is a bit more tricky than it sounds. Tensor-to-tensor functions are pretty complex and require keeping careful track of indexing on both the input and the output side.
Your goal in these puzzles is to implement the Jacobian, i.e. the derivative of each output cell with respect to each input cell, for each function below. In each case the function takes in a tensor and returns a tensor , so your job is to compute for all indices and . If you get discouraged, just remember, you did this in high school (it just had way less indexing).
Rules and Tips
-
Every answer is 1 line of 80-column code.
-
Everything in these puzzles should be done with standard Python numbers. (There is no need for Torch or tensors.)
-
Recall the basic multivariate calculus identities, most importantly:
-
Hint: Python booleans are numbers. So you can use them as indicator functions, i.e.
For each of the problems, a tensor function is provided. There are two solution formats:
- Given a lambda called
f
, your job is to fill in the functiondx(i, j)
which provides . These puzzles are about correctness, not efficiency. - Given some input and output tensors
Is and Os
and a function with Shaped Array types, fill out a functionjac
which returns the full Jacobian at a point (identical tojax.jacfwd(f)(x)
).
Problem 1: Id
Warmup:
Solution: This warmup asks us to compute . The derivative is 1 since it’s just the identity function. More explicitly, we know that is 1 for since .
def fb_id(x):
f = lambda o: x[0]
dx = lambda i, o: 1 # Fill in this line
return f, dx
Is = np.arange(1)
def f(x: Shaped[Array, "1"]) -> Shaped[Array, "1"]:
return 2 * x
def jac(x: Shaped[Array, "1"]) -> Shaped[Array, "1 1"]:
return 2 * (x==x)[None]
Problem 2: Cosine
Warmup:
Solution: The derivative of is . So we need to fill in the line dx = lambda i, o: -math.sin(x[0])
.
def fb_cos(x):
f = lambda o: math.cos(x[0])
dx = lambda i, o: -math.sin(x) # Fill in this line
return f, dx
import math
def f(x: Shaped[Array, "1"]) -> Shaped[Array, "1"]:
return np.cos(x)
def jac(x: Shaped[Array, "1"]) -> Shaped[Array, "1 1"]:
return -np.sin(x)[None]
Problem 3: Mean
Solution: The Jacobian has shape since the output only has one element. for all .
def fb_mean(x):
I = x.shape[0]
f = lambda o: sum(x[i] for i in range(I)) / I
dx = lambda i, o: 1 / I # Fill in this line
return f, dx
I = 10
Is = np.arange(I)
def f(x: Shaped[Array, "I"]) -> Shaped[Array, "1"]:
return np.mean(x, axis=0, keepdims=True)
def jac(x: Shaped[Array, "I"]) -> Shaped[Array, "1 I"]:
return 1 / x.shape[0] * (x == x)[None]
Problem 4: Product
Solution: The Jacobian has shape since the output only has one element. For a given , the derivative is the product of all the other elements. So we can write this as .
def fb_prod(x):
pr = torch.prod(x)
f = lambda o: pr
dx = lambda i, o: pr / x[i] if x[i] != 0 else 0
return f, dx
def f(x: Shaped[Array, "I"]) -> Shaped[Array, "1"]:
return np.prod(x, keepdims=True)
def jac(x: Shaped[Array, "I"]) -> Shaped[Array, "1 I"]:
pr = f(x)
return (pr / x)[None]
Problem 5: Repeat
Hint: The function dx
should return a scalar. It is the
derivative of , i.e. the o’th output.
Solution: The Jacobian has shape since the output has elements. The derivative is 1 for all .
def fb_repeat(x):
f = lambda o: x[0]
dx = lambda i, o: 1
return f, dx
Is = np.arange(1)
O = 10
Os = np.arange(O)[:, None]
def f(x: Shaped[Array, "1"]) -> Shaped[Array, "O"]:
return (x + (Os * 0 + 1))[:, 0]
def jac(x: Shaped[Array, "1"]) -> Shaped[Array, "O 1"]:
return (x == x)[None]
Problem 6: Repeat and Scale
Solution: The scalar in front of the repeated version of the input depends on its index. In particular, the -th element is . So the derivative is .
def fb_repeat_scale(x):
I = 50
f = lambda o: x[0] * (o / I)
dx = lambda i, o: (o / I)
return f, dx
Is = np.arange(1)
O = 10
Os = np.arange(O)[:, None]
def f(x: Shaped[Array, "1"]) -> Shaped[Array, "O"]:
return x * (Os / O)[:, 0]
def jac(x: Shaped[Array, "1"]) -> Shaped[Array, "O 1"]:
return Os / O
Problem 7: Negation
(Hint: remember the indicator trick, i.e.
(a == b) * 27 # 27 if a == b else 0
Solution: Here, the Jacobian has shape with since the input/output has elements. The Jacobian is a diagonal matrix with -1 on the diagonal.
def fb_neg(x):
f = lambda o: -x[o]
dx = lambda i, o: -(i == o)
return f, dx
I = 10
O = 10
Is = np.arange(I)
Os = np.arange(O)[:, None]
def f(x: Shaped[Array, "I"]) -> Shaped[Array, "O"]:
return -x
def jac(x: Shaped[Array, "I"]) -> Shaped[Array, "O I"]:
return (0 - (Os ==Is[None])).astype(float)
Problem 8: ReLU
Recall
(Note: you can ignore the not of non-differentiability at 0.)
Solution: ReLU is an element-wise function, so we know the Jacobian is a diagonal matrix. The derivative is 0 if and 1 otherwise.
def fb_relu(x):
f = lambda o: (x[o] > 0) * x[o]
dx = lambda i, o: (i == o) * (x[o] > 0)
return f, dx
I = 10
O = 10
Is = np.arange(I)
Os = np.arange(O)[:, None]
def f(x: Shaped[Array, "I"]) -> Shaped[Array, "O"]:
return x * (x > 0)
def jac(x: Shaped[Array, "I"]) -> Shaped[Array, "O I"]:
# x.shape (I, )
# (Os == Is).shape (O, I)
# Broadcasting (1, I) * (O, I)
return (x > 0) * (Os == Is)
Problem 8.5/9: Index
Solution: The Jacobian is a 15x25 matrix. are not used in the output, so the derivative is 0 for those indices. The outputs are just the inputs shifted by 10, so the derivative is 1 for and 0 otherwise.
# i o dx
# 0 0 0
# ...
# 10 0 1
# 11 1 1
# ...
def fb_index(x):
f = lambda o: x[o + 10]
dx = lambda i, o: 1 if i == (o + 10) else 0
return f, dx
I = 25
O = 15
Is = np.arange(I)
Os = np.arange(O)[:, None]
def f(x: Shaped[Array, "I"]) -> Shaped[Array, "O"]:
return x[10:]
def jac(x: Shaped[Array, "I"]) -> Shaped[Array, "O I"]:
return Is == (Os + 10)s