Deriving the gradient of the cosine loss with Tangent and function composition

2019-03-20 · 10 minute read

Over the last couple of months, I’ve found myself working out the gradient of the cosine loss twice. I seem to have lost the work the first time. This post makes sure I don’t have to work it all out a third time. Plus, if it turns out I’ve screwed this up, hopefully someone will point it out.

Why derive? Do you even autodiff?

If you’re using Tensorflow, PyTorch, or another library with tape-based auto-differentiation, all you have to implement is the loss, and the gradient will be calculated for you. spaCy uses its own library, Thinc, that works a bit differently. We use function composition to propagate gradients instead of tape-based auto-differentiation. This means instead of the loss, we need to provide the gradient of the loss. Besides, working out the gradient for yourself sometimes is healthy. Eat your vegetables, you know?

Solution 1: Copy and paste code from StackExchange

Whenever I google for a gradient calculation, I always end up at maths.stackexchange.com. I’m very grateful for these posts, but the truth is I seldom follow them very well. I’m just not practiced at manipulating equations in maths notation, like at all. I stopped taking maths classes at 16. Here’s what I come up with if I translate the final equation here into code.

def get_cossim_loss(yh, y):
    # Add a small constant to avoid 0 vectors
    yh = yh + 1e-8
    y = y + 1e-8
    # https://math.stackexchange.com/questions/1923613/partial-derivative-of-cosine-similarity
    xp = get_array_module(yh)
    norm_yh = xp.linalg.norm(yh, axis=1, keepdims=True)
    norm_y = xp.linalg.norm(y, axis=1, keepdims=True)
    mul_norms = norm_yh * norm_y
    cosine = (yh * y).sum(axis=1, keepdims=True) / mul_norms
    d_yh = (y / mul_norms) - (cosine * (yh / norm_yh**2))
    loss = xp.abs(1-cosine).sum()
    return loss, -d_yh

That looks plausible, but is it right? ¯_(ツ)_/¯.

I guess I could run some finite differences tests, but I’d rather be able to work these things out for myself. It’s not like the cosine is complicated — I should definitely be able to do this. So, let’s try.

import numpy as np

def sx_cosine_loss(yh, y):
    # Add a small constant to avoid 0 vectors
    yh = yh + 1e-8
    y = y + 1e-8
    # https://math.stackexchange.com/questions/1923613/partial-derivative-of-cosine-similarity
    norm_yh = np.linalg.norm(yh)
    norm_y = np.linalg.norm(y)
    mul_norms = norm_yh * norm_y
    cosine = (yh * y).sum() / mul_norms
    d_yh = (y / mul_norms) - (cosine * (yh / norm_yh**2))
    loss = np.abs(1-cosine).sum()
    return loss, -d_yh

Solution 2: Tangent and refactor

Tangent is a neat library for source-to-source autodifferentiation. You implement a function, and it generates code to calculate the gradient with respect to the inputs. This is a fantastic fit for my requirements, as I don’t want a runtime dependency. I just want some tooling assistance to help me make sure I’m doing it right. First, a quick refresher. Here’s the cosine loss:

import numpy as np

def cosine_loss(X, Y):
    xnorm = np.sqrt(np.sum(X*X))
    ynorm = np.sqrt(np.sum(Y*Y))
    similarity = np.sum(X*Y) / (xnorm * ynorm)
    return 1 - similarity

The similarity calculation ranges from -1 to 1. We want the loss to be 0 when the similarity is 1, so the loss is 1-similarity. Here’s what Tangent generates for the gradient calculation of this function:

import tangent

tangent_cosine_loss = tangent.grad(cosine_loss, verbose=True)

def dcosine_lossdX(X, Y, b_return=1.0):
    X_times_X = X * X
    _xnorm = np.sum(X_times_X)
    xnorm = np.sqrt(_xnorm)
    Y_times_Y = Y * Y
    _ynorm = np.sum(Y_times_Y)
    ynorm = np.sqrt(_ynorm)
    _similarity2 = xnorm * ynorm
    X_times_Y = X * Y
    _similarity = np.sum(X_times_Y)
    similarity = _similarity / _similarity2
    _return = 1 - similarity
    assert tangent.shapes_match(_return, b_return
        ), 'Shape mismatch between return value (%s) and seed derivative (%s)' % (
        numpy.shape(_return), numpy.shape(b_return))

    # Grad of: _similarity = np.sum(X_times_Y)
    _bsimilarity = -tangent.unbroadcast(b_return, similarity)
    bsimilarity = _bsimilarity

    # Grad of: similarity = np.sum(X * Y) / (xnorm * ynorm)
    _b_similarity = bsimilarity / _similarity2
    _b_similarity2 = -bsimilarity * _similarity / (_similarity2 * _similarity2)
    b_similarity = _b_similarity
    b_similarity2 = _b_similarity2
    _bX_times_Y = tangent.astype(tangent.unreduce(b_similarity, numpy.shape
        (X_times_Y), None, False), X_times_Y)
    bX_times_Y = _bX_times_Y
    _bX3 = tangent.unbroadcast(bX_times_Y * Y, X)
    bX = _bX3
    _bxnorm = tangent.unbroadcast(b_similarity2 * ynorm, xnorm)
    bxnorm = _bxnorm

    # Grad of: xnorm = np.sqrt(np.sum(X * X))
    _xnorm2 = xnorm
    _b_xnorm = bxnorm / (2.0 * _xnorm2)
    b_xnorm = _b_xnorm
    _bX_times_X = tangent.astype(tangent.unreduce(b_xnorm, numpy.shape(
        X_times_X), None, False), X_times_X)
    bX_times_X = _bX_times_X
    _bX = tangent.unbroadcast(bX_times_X * X, X)
    _bX2 = tangent.unbroadcast(bX_times_X * X, X)
    bX = tangent.add_grad(bX, _bX)
    bX = tangent.add_grad(bX, _bX2)
    return bX

Code! How nice. One thing we can do right away is test the code we cut and paste from StackExchange:

vec1 = np.random.uniform(size=128)
vec2 = np.random.uniform(size=128)
sx_loss, sx_grad = sx_cosine_loss(vec1, vec2)
tangent_grad = tangent_cosine_loss(vec1, vec2)

import numpy.testing

numpy.testing.assert_almost_equal(tangent_grad, sx_grad)

Checks out! From a practical perspective, our work is done here. We can now be pretty confident that the code from StackExchange is correct, so if we just want to move forward with our experiments, we can go ahead and paste it in. This is still not very satisfying, though. I don’t feel like I’m much closer to working out the gradient for myself. So let’s dig a little deeper. The first step is to break things up into smaller functions, so that Tangent is easier to work with. We’ll start with a function for the norm calculation:

def L2_norm(X):
    XX = X*X
    XX_sum = np.sum(XX)
    XX_sqrt = np.sqrt(XX_sum)
    return XX_sqrt

dL2_normdX = tangent.grad(L2_norm, verbose=True)

def dL2_normdX(X, bXX_sqrt=1.0):
    XX = X * X
    XX_sum = np.sum(XX)
    XX_sqrt = np.sqrt(XX_sum)
    assert tangent.shapes_match(XX_sqrt, bXX_sqrt
        ), 'Shape mismatch between return value (%s) and seed derivative (%s)' % (
        numpy.shape(XX_sqrt), numpy.shape(bXX_sqrt))

    # Beginning of backward pass
    _XX_sqrt = XX_sqrt

    # Grad of: XX_sqrt = np.sqrt(XX_sum)
    _bXX_sum = bXX_sqrt / (2.0 * _XX_sqrt)
    bXX_sum = _bXX_sum

    # Grad of: XX_sum = np.sum(XX)
    _bXX = tangent.astype(tangent.unreduce(bXX_sum, numpy.shape(XX), None,
        False), XX)
    bXX = _bXX

    # Grad of: XX = X * X
    _bX = tangent.unbroadcast(bXX * X, X)
    _bX2 = tangent.unbroadcast(bXX * X, X)
    bX = _bX
    bX = tangent.add_grad(bX, _bX2)
    return bX

The first thing we want to do is make some cosmetic improvements to the generated code, to make it more readable. We’ll strip out the assertions, some of the type casting, and make some renames:

def dL2_normdX_v2(X, bXX_sqrt=1.0):
    XX = X * X
    XX_sum = np.sum(XX)
    XX_sqrt = np.sqrt(XX_sum)
    # Beginning of backward pass
    # Grad of: XX_sqrt = np.sqrt(XX_sum)
    bXX_sum = bXX_sqrt / (2.0 * XX_sqrt)
    # Grad of: XX_sum = np.sum(XX)
    bXX = tangent.unreduce(bXX_sum, numpy.shape(XX), None, False)
    # Grad of: XX = X * X
    bX = tangent.unbroadcast(bXX * X, X)
    bX2 = tangent.unbroadcast(bXX * X, X)
    return tangent.add_grad(bX, bX2)

numpy.testing.assert_almost_equal(dL2_normdX(vec1), dL2_normdX_v2(vec1))

Next we want to remove the ugly tangent.unreduce and tangent.broadcast calls. These just make the code more general to different input shapes, which we don’t need for now. If you have a reduction operation total = vector.sum(), as you backprop you’ll end up with a value d_total, and want to use it to calculate d_vector. This will just be np.array(d_total, shape=vector.shape). This makes sense, right? If you needed the summation to come out lower, you should decrease the value for everything you’re summing up — and vice versa if you’re trying to make the summation come out lower. Instead of creating the d_vector array explicitly, we can just use numpy’s broadcasting rules. We’ll also make the obvious simplicitation of multiplying the gradient by 2, instead of calculating the same thing twice and adding them together:

def dL2_normdX_v2(X, bXX_sqrt=1.0):
    XX = X * X
    XX_sum = np.sum(XX)
    XX_sqrt = np.sqrt(XX_sum)
    # Beginning of backward pass
    bXX_sum = bXX_sqrt / (2.0 * XX_sqrt)
    return 2 * X * bXX_sum

numpy.testing.assert_almost_equal(dL2_normdX(vec1), dL2_normdX_v2(vec1))

One further transformation will make the function easier to work with. Instead of calculating the forward and backward passes in one function, let’s move the backward pass into a callback:

def L2_norm(X):
    XX = X * X
    XX_sum = np.sum(XX)
    XX_sqrt = np.sqrt(XX_sum)
    def get_dX(bXX_sqrt):
        bXX_sum = bXX_sqrt / (2.0 * XX_sqrt)
        return 2 * X * bXX_sum
    return XX_sqrt, get_dX

L2_vec1, get_d_vec1 = L2_norm(vec1)
d_vec1 = get_d_vec1(1.0)
numpy.testing.assert_almost_equal(dL2_normdX(vec1), d_vec1)

Using this callback-based approach makes it easy to write higher-order functions that remain differentiable, without having to repeat calculations. For instance, here’s a function that implements a feed-forward relationship:

def chain(func1, func2):
    """Compose two functions f and g, such that g(f(x))"""
    def forward(*inputs):
        result1, get_d_inputs = func1(*inputs)
        outputs, get_d_result1 = func2(*result1)
        def backprop(*d_outputs):
            d_result1 = get_d_result1(*d_outputs)
            d_inputs = get_d_inputs(*d_result1)
            return d_inputs
        return outputs, backprop
    return forward

With that in mind, let’s do the rest of the work to finish the gradient of the cosine loss. First we’ll make a little function that isolates the gradient of the dot product. This means we only have to work out the gradient of the step 1-(XY_dot / (X_norm * Y_norm). We can use Tangent to make sure we can’t screw up even this small step:

def dot(X, Y):
    XY = X*Y
    XY_sum = XY.sum()

    def get_dX(dXY_sum):
        return Y * dXY_sum

    def get_dY(dXY_sum):
        return X * dXY_sum

    return XY_sum, (get_dX, get_dY)


def cosine_division(XY_dot, X_norm, Y_norm):
    similarity = XY_dot / (X_norm * Y_norm)
    loss = 1-similarity
    return loss

tangent_backprop_cosine_division = tangent.grad(cosine_division, verbose=True, wrt=(0, 1, 2))

def dcosine_divisiondXY_dotX_normY_norm(XY_dot, X_norm, Y_norm, bloss=1.0):
    _similarity = X_norm * Y_norm
    similarity = XY_dot / _similarity
    loss = 1 - similarity
    assert tangent.shapes_match(loss, bloss
        ), 'Shape mismatch between return value (%s) and seed derivative (%s)' % (
        numpy.shape(loss), numpy.shape(bloss))

    # Grad of: loss = 1 - similarity
    _bsimilarity = -tangent.unbroadcast(bloss, similarity)
    bsimilarity = _bsimilarity

    # Grad of: similarity = XY_dot / (X_norm * Y_norm)
    _bXY_dot = bsimilarity / _similarity
    _b_similarity = -bsimilarity * XY_dot / (_similarity * _similarity)
    bXY_dot = _bXY_dot
    b_similarity = _b_similarity
    _bX_norm = tangent.unbroadcast(b_similarity * Y_norm, X_norm)
    _bY_norm = tangent.unbroadcast(b_similarity * X_norm, Y_norm)
    bX_norm = _bX_norm
    bY_norm = _bY_norm
    return bXY_dot, bX_norm, bY_norm

Cleaning this up and moving it into our callback-based style, we get:

def tangent_backprop_cosine_division(XY_dot, X_norm, Y_norm, d_loss=1.0):
    similarity = XY_dot / _similarity
    loss = 1 - similarity

    # Grad of: loss = 1 - similarity
    d_similarity = -d_loss
    # Grad of: similarity = XY_dot / (X_norm * Y_norm)
    d_XY_dot = d_similarity / similarity
    d_similarity = -d_similarity * XY_dot / (similarity * similarity)
    d_X_norm = d_similarity * Y_norm
    d_Y_norm = d_similarity * X_norm
    return dXY_dot, dX_norm, dY_norm


def cosine_divsion_v2(XY_dot, X_norm, Y_norm):
    similarity = XY_dot / (X_norm * Y_norm)
    loss = 1 - similarity

    def get_d_XY_dot(d_loss):
        return -d_loss / similarity

    def get_dX_norm(d_loss):
        d_similarity = d_loss * XY_dot / (similarity * similarity)
        return d_similarity * Y_norm

    def get_dY_norm(d_loss):
        d_similarity = d_loss * XY_dot / (similarity * similarity)
        return d_similarity * X_norm

    return loss, (get_d_XY_dot, get_dX_norm, get_dY_norm)

Finally, let’s put it all together. I’ll ignore the calculations for dY, and just provide the gradient with respect to dX.

def cosine_loss_v3(X, Y):
    XY_dot = (X*Y).sum()
    X_norm = np.sqrt((X*X).sum())
    Y_norm = np.sqrt((Y*Y).sum())
    similarity = XY_dot / (X_norm * Y_norm)
    loss = 1 - similarity

    def get_dX(d_loss):
        d_XY_dot = -d_loss / similarity
        d_X_norm = d_loss * XY_dot / (similarity ** 2)
        dX = d_XY_dot * Y + d_X_norm * (2.0 * X_norm)
        # dX = (X / (X_norm*Y_norm)) - (similarity * (X / X_norm**2))
        return dX
    return loss, get_dX

loss, get_d_vec1 = cosine_loss_v3(vec1, vec2)
my_grad = get_d_vec1(1.0)
numpy.testing.assert_almost_equal(tangent_grad, my_grad)