honnibal.dev

Deriving the gradient of the cosine loss with Tangent and function composition

2019-03-20 · 10 minute read

Over the last couple of months, I’ve found myself working out the gradient of the cosine loss twice. I seem to have lost the work the first time. This post makes sure I don’t have to work it all out a third time. Plus, if it turns out I’ve screwed this up, hopefully someone will point it out.

Why derive? Do you even autodiff?

If you’re using Tensorflow, PyTorch, or another library with tape-based auto-differentiation, all you have to implement is the loss, and the gradient will be calculated for you. spaCy uses its own library, Thinc, that works a bit differently. We use function composition to propagate gradients instead of tape-based auto-differentiation. This means instead of the loss, we need to provide the gradient of the loss. Besides, working out the gradient for yourself sometimes is healthy. Eat your vegetables, you know?

Solution 1: Copy and paste code from StackExchange

Whenever I google for a gradient calculation, I always end up at maths.stackexchange.com. I’m very grateful for these posts, but the truth is I seldom follow them very well. I’m just not practiced at manipulating equations in maths notation, like at all. I stopped taking maths classes at 16. Here’s what I come up with if I translate the final equation here into code.

def get_cossim_loss(yh, y):
# Add a small constant to avoid 0 vectors
yh = yh + 1e-8
y = y + 1e-8
# https://math.stackexchange.com/questions/1923613/partial-derivative-of-cosine-similarity
xp = get_array_module(yh)
norm_yh = xp.linalg.norm(yh, axis=1, keepdims=True)
norm_y = xp.linalg.norm(y, axis=1, keepdims=True)
mul_norms = norm_yh * norm_y
cosine = (yh * y).sum(axis=1, keepdims=True) / mul_norms
d_yh = (y / mul_norms) - (cosine * (yh / norm_yh**2))
loss = xp.abs(1-cosine).sum()
return loss, -d_yh

That looks plausible, but is it right? ¯_(ツ)_/¯.

I guess I could run some finite differences tests, but I’d rather be able to work these things out for myself. It’s not like the cosine is complicated — I should definitely be able to do this. So, let’s try.

import numpy as np
def sx_cosine_loss(yh, y):
# Add a small constant to avoid 0 vectors
yh = yh + 1e-8
y = y + 1e-8
# https://math.stackexchange.com/questions/1923613/partial-derivative-of-cosine-similarity
norm_yh = np.linalg.norm(yh)
norm_y = np.linalg.norm(y)
mul_norms = norm_yh * norm_y
cosine = (yh * y).sum() / mul_norms
d_yh = (y / mul_norms) - (cosine * (yh / norm_yh**2))
loss = np.abs(1-cosine).sum()
return loss, -d_yh

Solution 2: Tangent and refactor

Tangent is a neat library for source-to-source autodifferentiation. You implement a function, and it generates code to calculate the gradient with respect to the inputs. This is a fantastic fit for my requirements, as I don’t want a runtime dependency. I just want some tooling assistance to help me make sure I’m doing it right. First, a quick refresher. Here’s the cosine loss:

import numpy as np
def cosine_loss(X, Y):
xnorm = np.sqrt(np.sum(X*X))
ynorm = np.sqrt(np.sum(Y*Y))
similarity = np.sum(X*Y) / (xnorm * ynorm)
return 1 - similarity

The similarity calculation ranges from -1 to 1. We want the loss to be 0 when the similarity is 1, so the loss is 1-similarity. Here’s what Tangent generates for the gradient calculation of this function:

import tangent
tangent_cosine_loss = tangent.grad(cosine_loss, verbose=True)
def dcosine_lossdX(X, Y, b_return=1.0):
X_times_X = X * X
_xnorm = np.sum(X_times_X)
xnorm = np.sqrt(_xnorm)
Y_times_Y = Y * Y
_ynorm = np.sum(Y_times_Y)
ynorm = np.sqrt(_ynorm)
_similarity2 = xnorm * ynorm
X_times_Y = X * Y
_similarity = np.sum(X_times_Y)
similarity = _similarity / _similarity2
_return = 1 - similarity
assert tangent.shapes_match(_return, b_return
), 'Shape mismatch between return value (%s) and seed derivative (%s)' % (
numpy.shape(_return), numpy.shape(b_return))
# Grad of: _similarity = np.sum(X_times_Y)
_bsimilarity = -tangent.unbroadcast(b_return, similarity)
bsimilarity = _bsimilarity
# Grad of: similarity = np.sum(X * Y) / (xnorm * ynorm)
_b_similarity = bsimilarity / _similarity2
_b_similarity2 = -bsimilarity * _similarity / (_similarity2 * _similarity2)
b_similarity = _b_similarity
b_similarity2 = _b_similarity2
_bX_times_Y = tangent.astype(tangent.unreduce(b_similarity, numpy.shape
(X_times_Y), None, False), X_times_Y)
bX_times_Y = _bX_times_Y
_bX3 = tangent.unbroadcast(bX_times_Y * Y, X)
bX = _bX3
_bxnorm = tangent.unbroadcast(b_similarity2 * ynorm, xnorm)
bxnorm = _bxnorm
# Grad of: xnorm = np.sqrt(np.sum(X * X))
_xnorm2 = xnorm
_b_xnorm = bxnorm / (2.0 * _xnorm2)
b_xnorm = _b_xnorm
_bX_times_X = tangent.astype(tangent.unreduce(b_xnorm, numpy.shape(
X_times_X), None, False), X_times_X)
bX_times_X = _bX_times_X
_bX = tangent.unbroadcast(bX_times_X * X, X)
_bX2 = tangent.unbroadcast(bX_times_X * X, X)
bX = tangent.add_grad(bX, _bX)
bX = tangent.add_grad(bX, _bX2)
return bX

Code! How nice. One thing we can do right away is test the code we cut and paste from StackExchange:

vec1 = np.random.uniform(size=128)
vec2 = np.random.uniform(size=128)
sx_loss, sx_grad = sx_cosine_loss(vec1, vec2)
tangent_grad = tangent_cosine_loss(vec1, vec2)
import numpy.testing
numpy.testing.assert_almost_equal(tangent_grad, sx_grad)

Checks out! From a practical perspective, our work is done here. We can now be pretty confident that the code from StackExchange is correct, so if we just want to move forward with our experiments, we can go ahead and paste it in. This is still not very satisfying, though. I don’t feel like I’m much closer to working out the gradient for myself. So let’s dig a little deeper. The first step is to break things up into smaller functions, so that Tangent is easier to work with. We’ll start with a function for the norm calculation:

def L2_norm(X):
XX = X*X
XX_sum = np.sum(XX)
XX_sqrt = np.sqrt(XX_sum)
return XX_sqrt
dL2_normdX = tangent.grad(L2_norm, verbose=True)
def dL2_normdX(X, bXX_sqrt=1.0):
XX = X * X
XX_sum = np.sum(XX)
XX_sqrt = np.sqrt(XX_sum)
assert tangent.shapes_match(XX_sqrt, bXX_sqrt
), 'Shape mismatch between return value (%s) and seed derivative (%s)' % (
numpy.shape(XX_sqrt), numpy.shape(bXX_sqrt))
# Beginning of backward pass
_XX_sqrt = XX_sqrt
# Grad of: XX_sqrt = np.sqrt(XX_sum)
_bXX_sum = bXX_sqrt / (2.0 * _XX_sqrt)
bXX_sum = _bXX_sum
# Grad of: XX_sum = np.sum(XX)
_bXX = tangent.astype(tangent.unreduce(bXX_sum, numpy.shape(XX), None,
False), XX)
bXX = _bXX
# Grad of: XX = X * X
_bX = tangent.unbroadcast(bXX * X, X)
_bX2 = tangent.unbroadcast(bXX * X, X)
bX = _bX
bX = tangent.add_grad(bX, _bX2)
return bX

The first thing we want to do is make some cosmetic improvements to the generated code, to make it more readable. We’ll strip out the assertions, some of the type casting, and make some renames:

def dL2_normdX_v2(X, bXX_sqrt=1.0):
XX = X * X
XX_sum = np.sum(XX)
XX_sqrt = np.sqrt(XX_sum)
# Beginning of backward pass
# Grad of: XX_sqrt = np.sqrt(XX_sum)
bXX_sum = bXX_sqrt / (2.0 * XX_sqrt)
# Grad of: XX_sum = np.sum(XX)
bXX = tangent.unreduce(bXX_sum, numpy.shape(XX), None, False)
# Grad of: XX = X * X
bX = tangent.unbroadcast(bXX * X, X)
bX2 = tangent.unbroadcast(bXX * X, X)
return tangent.add_grad(bX, bX2)
numpy.testing.assert_almost_equal(dL2_normdX(vec1), dL2_normdX_v2(vec1))

Next we want to remove the ugly tangent.unreduce and tangent.broadcast calls. These just make the code more general to different input shapes, which we don’t need for now. If you have a reduction operation total = vector.sum(), as you backprop you’ll end up with a value d_total, and want to use it to calculate d_vector. This will just be np.array(d_total, shape=vector.shape). This makes sense, right? If you needed the summation to come out lower, you should decrease the value for everything you’re summing up — and vice versa if you’re trying to make the summation come out lower. Instead of creating the d_vector array explicitly, we can just use numpy’s broadcasting rules. We’ll also make the obvious simplicitation of multiplying the gradient by 2, instead of calculating the same thing twice and adding them together:

def dL2_normdX_v2(X, bXX_sqrt=1.0):
XX = X * X
XX_sum = np.sum(XX)
XX_sqrt = np.sqrt(XX_sum)
# Beginning of backward pass
bXX_sum = bXX_sqrt / (2.0 * XX_sqrt)
return 2 * X * bXX_sum
numpy.testing.assert_almost_equal(dL2_normdX(vec1), dL2_normdX_v2(vec1))

One further transformation will make the function easier to work with. Instead of calculating the forward and backward passes in one function, let’s move the backward pass into a callback:

def L2_norm(X):
XX = X * X
XX_sum = np.sum(XX)
XX_sqrt = np.sqrt(XX_sum)
def get_dX(bXX_sqrt):
bXX_sum = bXX_sqrt / (2.0 * XX_sqrt)
return 2 * X * bXX_sum
return XX_sqrt, get_dX
L2_vec1, get_d_vec1 = L2_norm(vec1)
d_vec1 = get_d_vec1(1.0)
numpy.testing.assert_almost_equal(dL2_normdX(vec1), d_vec1)

Using this callback-based approach makes it easy to write higher-order functions that remain differentiable, without having to repeat calculations. For instance, here’s a function that implements a feed-forward relationship:

def chain(func1, func2):
"""Compose two functions f and g, such that g(f(x))"""
def forward(*inputs):
result1, get_d_inputs = func1(*inputs)
outputs, get_d_result1 = func2(*result1)
def backprop(*d_outputs):
d_result1 = get_d_result1(*d_outputs)
d_inputs = get_d_inputs(*d_result1)
return d_inputs
return outputs, backprop
return forward

With that in mind, let’s do the rest of the work to finish the gradient of the cosine loss. First we’ll make a little function that isolates the gradient of the dot product. This means we only have to work out the gradient of the step 1-(XY_dot / (X_norm * Y_norm). We can use Tangent to make sure we can’t screw up even this small step:

def dot(X, Y):
XY = X*Y
XY_sum = XY.sum()
def get_dX(dXY_sum):
return Y * dXY_sum
def get_dY(dXY_sum):
return X * dXY_sum
return XY_sum, (get_dX, get_dY)
def cosine_division(XY_dot, X_norm, Y_norm):
similarity = XY_dot / (X_norm * Y_norm)
loss = 1-similarity
return loss
tangent_backprop_cosine_division = tangent.grad(cosine_division, verbose=True, wrt=(0, 1, 2))
def dcosine_divisiondXY_dotX_normY_norm(XY_dot, X_norm, Y_norm, bloss=1.0):
_similarity = X_norm * Y_norm
similarity = XY_dot / _similarity
loss = 1 - similarity
assert tangent.shapes_match(loss, bloss
), 'Shape mismatch between return value (%s) and seed derivative (%s)' % (
numpy.shape(loss), numpy.shape(bloss))
# Grad of: loss = 1 - similarity
_bsimilarity = -tangent.unbroadcast(bloss, similarity)
bsimilarity = _bsimilarity
# Grad of: similarity = XY_dot / (X_norm * Y_norm)
_bXY_dot = bsimilarity / _similarity
_b_similarity = -bsimilarity * XY_dot / (_similarity * _similarity)
bXY_dot = _bXY_dot
b_similarity = _b_similarity
_bX_norm = tangent.unbroadcast(b_similarity * Y_norm, X_norm)
_bY_norm = tangent.unbroadcast(b_similarity * X_norm, Y_norm)
bX_norm = _bX_norm
bY_norm = _bY_norm
return bXY_dot, bX_norm, bY_norm

Cleaning this up and moving it into our callback-based style, we get:

def tangent_backprop_cosine_division(XY_dot, X_norm, Y_norm, d_loss=1.0):
similarity = XY_dot / _similarity
loss = 1 - similarity
# Grad of: loss = 1 - similarity
d_similarity = -d_loss
# Grad of: similarity = XY_dot / (X_norm * Y_norm)
d_XY_dot = d_similarity / similarity
d_similarity = -d_similarity * XY_dot / (similarity * similarity)
d_X_norm = d_similarity * Y_norm
d_Y_norm = d_similarity * X_norm
return dXY_dot, dX_norm, dY_norm
def cosine_divsion_v2(XY_dot, X_norm, Y_norm):
similarity = XY_dot / (X_norm * Y_norm)
loss = 1 - similarity
def get_d_XY_dot(d_loss):
return -d_loss / similarity
def get_dX_norm(d_loss):
d_similarity = d_loss * XY_dot / (similarity * similarity)
return d_similarity * Y_norm
def get_dY_norm(d_loss):
d_similarity = d_loss * XY_dot / (similarity * similarity)
return d_similarity * X_norm
return loss, (get_d_XY_dot, get_dX_norm, get_dY_norm)

Finally, let’s put it all together. I’ll ignore the calculations for dY, and just provide the gradient with respect to dX.

def cosine_loss_v3(X, Y):
XY_dot = (X*Y).sum()
X_norm = np.sqrt((X*X).sum())
Y_norm = np.sqrt((Y*Y).sum())
similarity = XY_dot / (X_norm * Y_norm)
loss = 1 - similarity
def get_dX(d_loss):
d_XY_dot = -d_loss / similarity
d_X_norm = d_loss * XY_dot / (similarity ** 2)
dX = d_XY_dot * Y + d_X_norm * (2.0 * X_norm)
# dX = (X / (X_norm*Y_norm)) - (similarity * (X / X_norm**2))
return dX
return loss, get_dX
loss, get_d_vec1 = cosine_loss_v3(vec1, vec2)
my_grad = get_d_vec1(1.0)
numpy.testing.assert_almost_equal(tangent_grad, my_grad)