Over the last couple of months, I’ve found myself working out the gradient of the cosine loss twice. I seem to have lost the work the first time. This post makes sure I don’t have to work it all out a third time. Plus, if it turns out I’ve screwed this up, hopefully someone will point it out.
Why derive? Do you even autodiff?
If you’re using Tensorflow, PyTorch, or another library with tape-based auto-differentiation, all you have to implement is the loss, and the gradient will be calculated for you. spaCy uses its own library, Thinc, that works a bit differently. We use function composition to propagate gradients instead of tape-based auto-differentiation. This means instead of the loss, we need to provide the gradient of the loss. Besides, working out the gradient for yourself sometimes is healthy. Eat your vegetables, you know?
Solution 1: Copy and paste code from StackExchange
Whenever I google for a gradient calculation, I always end up at
maths.stackexchange.com
. I’m very grateful for these posts, but the truth is I
seldom follow them very well. I’m just not practiced at manipulating equations
in maths notation, like at all. I stopped taking maths classes at 16. Here’s
what I come up with if I translate the
final equation here
into code.
def get_cossim_loss(yh, y): # Add a small constant to avoid 0 vectors yh = yh + 1e-8 y = y + 1e-8 # https://math.stackexchange.com/questions/1923613/partial-derivative-of-cosine-similarity xp = get_array_module(yh) norm_yh = xp.linalg.norm(yh, axis=1, keepdims=True) norm_y = xp.linalg.norm(y, axis=1, keepdims=True) mul_norms = norm_yh * norm_y cosine = (yh * y).sum(axis=1, keepdims=True) / mul_norms d_yh = (y / mul_norms) - (cosine * (yh / norm_yh**2)) loss = xp.abs(1-cosine).sum() return loss, -d_yh
That looks plausible, but is it right? ¯_(ツ)_/¯.
I guess I could run some finite differences tests, but I’d rather be able to work these things out for myself. It’s not like the cosine is complicated — I should definitely be able to do this. So, let’s try.
import numpy as np
def sx_cosine_loss(yh, y): # Add a small constant to avoid 0 vectors yh = yh + 1e-8 y = y + 1e-8 # https://math.stackexchange.com/questions/1923613/partial-derivative-of-cosine-similarity norm_yh = np.linalg.norm(yh) norm_y = np.linalg.norm(y) mul_norms = norm_yh * norm_y cosine = (yh * y).sum() / mul_norms d_yh = (y / mul_norms) - (cosine * (yh / norm_yh**2)) loss = np.abs(1-cosine).sum() return loss, -d_yh
Solution 2: Tangent and refactor
Tangent is a neat library for source-to-source autodifferentiation. You implement a function, and it generates code to calculate the gradient with respect to the inputs. This is a fantastic fit for my requirements, as I don’t want a runtime dependency. I just want some tooling assistance to help me make sure I’m doing it right. First, a quick refresher. Here’s the cosine loss:
import numpy as np
def cosine_loss(X, Y): xnorm = np.sqrt(np.sum(X*X)) ynorm = np.sqrt(np.sum(Y*Y)) similarity = np.sum(X*Y) / (xnorm * ynorm) return 1 - similarity
The similarity calculation ranges from -1 to 1. We want the loss to be 0 when
the similarity is 1, so the loss is 1-similarity
. Here’s what Tangent
generates for the gradient calculation of this function:
import tangent
tangent_cosine_loss = tangent.grad(cosine_loss, verbose=True)
def dcosine_lossdX(X, Y, b_return=1.0): X_times_X = X * X _xnorm = np.sum(X_times_X) xnorm = np.sqrt(_xnorm) Y_times_Y = Y * Y _ynorm = np.sum(Y_times_Y) ynorm = np.sqrt(_ynorm) _similarity2 = xnorm * ynorm X_times_Y = X * Y _similarity = np.sum(X_times_Y) similarity = _similarity / _similarity2 _return = 1 - similarity assert tangent.shapes_match(_return, b_return ), 'Shape mismatch between return value (%s) and seed derivative (%s)' % ( numpy.shape(_return), numpy.shape(b_return))
# Grad of: _similarity = np.sum(X_times_Y) _bsimilarity = -tangent.unbroadcast(b_return, similarity) bsimilarity = _bsimilarity
# Grad of: similarity = np.sum(X * Y) / (xnorm * ynorm) _b_similarity = bsimilarity / _similarity2 _b_similarity2 = -bsimilarity * _similarity / (_similarity2 * _similarity2) b_similarity = _b_similarity b_similarity2 = _b_similarity2 _bX_times_Y = tangent.astype(tangent.unreduce(b_similarity, numpy.shape (X_times_Y), None, False), X_times_Y) bX_times_Y = _bX_times_Y _bX3 = tangent.unbroadcast(bX_times_Y * Y, X) bX = _bX3 _bxnorm = tangent.unbroadcast(b_similarity2 * ynorm, xnorm) bxnorm = _bxnorm
# Grad of: xnorm = np.sqrt(np.sum(X * X)) _xnorm2 = xnorm _b_xnorm = bxnorm / (2.0 * _xnorm2) b_xnorm = _b_xnorm _bX_times_X = tangent.astype(tangent.unreduce(b_xnorm, numpy.shape( X_times_X), None, False), X_times_X) bX_times_X = _bX_times_X _bX = tangent.unbroadcast(bX_times_X * X, X) _bX2 = tangent.unbroadcast(bX_times_X * X, X) bX = tangent.add_grad(bX, _bX) bX = tangent.add_grad(bX, _bX2) return bX
Code! How nice. One thing we can do right away is test the code we cut and paste from StackExchange:
vec1 = np.random.uniform(size=128)vec2 = np.random.uniform(size=128)sx_loss, sx_grad = sx_cosine_loss(vec1, vec2)tangent_grad = tangent_cosine_loss(vec1, vec2)
import numpy.testing
numpy.testing.assert_almost_equal(tangent_grad, sx_grad)
Checks out! From a practical perspective, our work is done here. We can now be pretty confident that the code from StackExchange is correct, so if we just want to move forward with our experiments, we can go ahead and paste it in. This is still not very satisfying, though. I don’t feel like I’m much closer to working out the gradient for myself. So let’s dig a little deeper. The first step is to break things up into smaller functions, so that Tangent is easier to work with. We’ll start with a function for the norm calculation:
def L2_norm(X): XX = X*X XX_sum = np.sum(XX) XX_sqrt = np.sqrt(XX_sum) return XX_sqrt
dL2_normdX = tangent.grad(L2_norm, verbose=True)
def dL2_normdX(X, bXX_sqrt=1.0): XX = X * X XX_sum = np.sum(XX) XX_sqrt = np.sqrt(XX_sum) assert tangent.shapes_match(XX_sqrt, bXX_sqrt ), 'Shape mismatch between return value (%s) and seed derivative (%s)' % ( numpy.shape(XX_sqrt), numpy.shape(bXX_sqrt))
# Beginning of backward pass _XX_sqrt = XX_sqrt
# Grad of: XX_sqrt = np.sqrt(XX_sum) _bXX_sum = bXX_sqrt / (2.0 * _XX_sqrt) bXX_sum = _bXX_sum
# Grad of: XX_sum = np.sum(XX) _bXX = tangent.astype(tangent.unreduce(bXX_sum, numpy.shape(XX), None, False), XX) bXX = _bXX
# Grad of: XX = X * X _bX = tangent.unbroadcast(bXX * X, X) _bX2 = tangent.unbroadcast(bXX * X, X) bX = _bX bX = tangent.add_grad(bX, _bX2) return bX
The first thing we want to do is make some cosmetic improvements to the generated code, to make it more readable. We’ll strip out the assertions, some of the type casting, and make some renames:
def dL2_normdX_v2(X, bXX_sqrt=1.0): XX = X * X XX_sum = np.sum(XX) XX_sqrt = np.sqrt(XX_sum) # Beginning of backward pass # Grad of: XX_sqrt = np.sqrt(XX_sum) bXX_sum = bXX_sqrt / (2.0 * XX_sqrt) # Grad of: XX_sum = np.sum(XX) bXX = tangent.unreduce(bXX_sum, numpy.shape(XX), None, False) # Grad of: XX = X * X bX = tangent.unbroadcast(bXX * X, X) bX2 = tangent.unbroadcast(bXX * X, X) return tangent.add_grad(bX, bX2)
numpy.testing.assert_almost_equal(dL2_normdX(vec1), dL2_normdX_v2(vec1))
Next we want to remove the ugly tangent.unreduce
and tangent.broadcast
calls. These just make the code more general to different input shapes, which we
don’t need for now. If you have a reduction operation total = vector.sum()
, as
you backprop you’ll end up with a value d_total
, and want to use it to
calculate d_vector
. This will just be np.array(d_total, shape=vector.shape)
.
This makes sense, right? If you needed the summation to come out lower, you
should decrease the value for everything you’re summing up — and vice versa if
you’re trying to make the summation come out lower. Instead of creating the
d_vector
array explicitly, we can just use numpy’s broadcasting rules. We’ll
also make the obvious simplicitation of multiplying the gradient by 2, instead
of calculating the same thing twice and adding them together:
def dL2_normdX_v2(X, bXX_sqrt=1.0): XX = X * X XX_sum = np.sum(XX) XX_sqrt = np.sqrt(XX_sum) # Beginning of backward pass bXX_sum = bXX_sqrt / (2.0 * XX_sqrt) return 2 * X * bXX_sum
numpy.testing.assert_almost_equal(dL2_normdX(vec1), dL2_normdX_v2(vec1))
One further transformation will make the function easier to work with. Instead of calculating the forward and backward passes in one function, let’s move the backward pass into a callback:
def L2_norm(X): XX = X * X XX_sum = np.sum(XX) XX_sqrt = np.sqrt(XX_sum) def get_dX(bXX_sqrt): bXX_sum = bXX_sqrt / (2.0 * XX_sqrt) return 2 * X * bXX_sum return XX_sqrt, get_dX
L2_vec1, get_d_vec1 = L2_norm(vec1)d_vec1 = get_d_vec1(1.0)numpy.testing.assert_almost_equal(dL2_normdX(vec1), d_vec1)
Using this callback-based approach makes it easy to write higher-order functions that remain differentiable, without having to repeat calculations. For instance, here’s a function that implements a feed-forward relationship:
def chain(func1, func2): """Compose two functions f and g, such that g(f(x))""" def forward(*inputs): result1, get_d_inputs = func1(*inputs) outputs, get_d_result1 = func2(*result1) def backprop(*d_outputs): d_result1 = get_d_result1(*d_outputs) d_inputs = get_d_inputs(*d_result1) return d_inputs return outputs, backprop return forward
With that in mind, let’s do the rest of the work to finish the gradient of the
cosine loss. First we’ll make a little function that isolates the gradient of
the dot product. This means we only have to work out the gradient of the step
1-(XY_dot / (X_norm * Y_norm)
. We can use Tangent to make sure we can’t screw
up even this small step:
def dot(X, Y): XY = X*Y XY_sum = XY.sum()
def get_dX(dXY_sum): return Y * dXY_sum
def get_dY(dXY_sum): return X * dXY_sum
return XY_sum, (get_dX, get_dY)
def cosine_division(XY_dot, X_norm, Y_norm): similarity = XY_dot / (X_norm * Y_norm) loss = 1-similarity return loss
tangent_backprop_cosine_division = tangent.grad(cosine_division, verbose=True, wrt=(0, 1, 2))
def dcosine_divisiondXY_dotX_normY_norm(XY_dot, X_norm, Y_norm, bloss=1.0): _similarity = X_norm * Y_norm similarity = XY_dot / _similarity loss = 1 - similarity assert tangent.shapes_match(loss, bloss ), 'Shape mismatch between return value (%s) and seed derivative (%s)' % ( numpy.shape(loss), numpy.shape(bloss))
# Grad of: loss = 1 - similarity _bsimilarity = -tangent.unbroadcast(bloss, similarity) bsimilarity = _bsimilarity
# Grad of: similarity = XY_dot / (X_norm * Y_norm) _bXY_dot = bsimilarity / _similarity _b_similarity = -bsimilarity * XY_dot / (_similarity * _similarity) bXY_dot = _bXY_dot b_similarity = _b_similarity _bX_norm = tangent.unbroadcast(b_similarity * Y_norm, X_norm) _bY_norm = tangent.unbroadcast(b_similarity * X_norm, Y_norm) bX_norm = _bX_norm bY_norm = _bY_norm return bXY_dot, bX_norm, bY_norm
Cleaning this up and moving it into our callback-based style, we get:
def tangent_backprop_cosine_division(XY_dot, X_norm, Y_norm, d_loss=1.0): similarity = XY_dot / _similarity loss = 1 - similarity
# Grad of: loss = 1 - similarity d_similarity = -d_loss # Grad of: similarity = XY_dot / (X_norm * Y_norm) d_XY_dot = d_similarity / similarity d_similarity = -d_similarity * XY_dot / (similarity * similarity) d_X_norm = d_similarity * Y_norm d_Y_norm = d_similarity * X_norm return dXY_dot, dX_norm, dY_norm
def cosine_divsion_v2(XY_dot, X_norm, Y_norm): similarity = XY_dot / (X_norm * Y_norm) loss = 1 - similarity
def get_d_XY_dot(d_loss): return -d_loss / similarity
def get_dX_norm(d_loss): d_similarity = d_loss * XY_dot / (similarity * similarity) return d_similarity * Y_norm
def get_dY_norm(d_loss): d_similarity = d_loss * XY_dot / (similarity * similarity) return d_similarity * X_norm
return loss, (get_d_XY_dot, get_dX_norm, get_dY_norm)
Finally, let’s put it all together. I’ll ignore the calculations for dY, and just provide the gradient with respect to dX.
def cosine_loss_v3(X, Y): XY_dot = (X*Y).sum() X_norm = np.sqrt((X*X).sum()) Y_norm = np.sqrt((Y*Y).sum()) similarity = XY_dot / (X_norm * Y_norm) loss = 1 - similarity
def get_dX(d_loss): d_XY_dot = -d_loss / similarity d_X_norm = d_loss * XY_dot / (similarity ** 2) dX = d_XY_dot * Y + d_X_norm * (2.0 * X_norm) # dX = (X / (X_norm*Y_norm)) - (similarity * (X / X_norm**2)) return dX return loss, get_dX
loss, get_d_vec1 = cosine_loss_v3(vec1, vec2)my_grad = get_d_vec1(1.0)numpy.testing.assert_almost_equal(tangent_grad, my_grad)