Gradient of primitive and confusion #19514

ericmjonas · 2024-01-25T11:09:45Z

ericmjonas
Jan 25, 2024

Hello ! I'm filing this as a "discussion" because I'm sure it's an error in my understanding and not actually a bug. I'm trying to fit a function's gradient with machine learning, where the function contains calls to a custom primitive p(x) which is backed by a pile of C++/CUDA.

In pcode I have:

def nn(w, x):
    return w**2 * my_func(x) # just a placeholder for some parameterized function of the output of my_func

def loss(w, x, y):
    g_nn = jax.grad(nn, argnums=1)  # with respect to x

    l = (g_nn(w, x) -y)**2
    return l

g_loss = jax.grad(loss, argnums=0)

g_loss(1.0, 2.0, 3.0)

Technically I have two primitives, a fwd and a bwd that I then set up with defvjp. As far as I can tell, this should only ever require the VJP of my_func.

However, I am getting the error: Differentiation rule for 'my_bwd_p' not implemented which is confusing me, as I really don't think we should need anything beyond first derivatives for my_func.

I have constructed a full example for my own primitive that just implements sin, below. I've tried everything I can think of, including liberal application of jax.lax.stop_gradient. The real code I care about for my_func's fwd and bwd primitives is incredibly complicated, and the idea of implementing a higher-order gradient (that is, the derivative rule for bwd) is sort of soul-crushing. But I don't think it should be necessary?

I'm attaching a full example of what I'm running into below, and it is also available as a collab notebook here: https://colab.research.google.com/drive/1T0HcQlELiUJVw6OhptNox6Ed_LsaCYOm?usp=sharing

def my_fwd_prim_apply(x):
    r = fwd_p.bind(x)
    return r[0], r[1]
    
def my_fwd_prim_impl(x):
    return np.sin(x), np.cos(x)

def my_bwd_prim_apply(x, y):
    r = bwd_p.bind(x, y)
    return r,


def my_bwd_prim_impl(x, y):
    return x


@custom_vjp
def my_func(x):
    return my_fwd_prim_apply(x)[0]


fwd_p = core.Primitive('my_fwd_p')
fwd_p.multiple_results = True
fwd_p.def_impl(my_fwd_prim_impl)


bwd_p = core.Primitive('my_bwd_p')
bwd_p.multiple_results = False
bwd_p.def_impl(my_bwd_prim_impl)

my_func.defvjp(my_fwd_prim_apply, my_bwd_prim_apply)

# now the tricky part

def nn(w, x):
    return w**2 * my_func(x)

def loss(w, x, y):
    g_nn = jax.grad(nn, argnums=1)  # with respect to X

    l = (g_nn(w, x) -y)**2
    return l

g_loss = jax.grad(loss, argnums=0)

g_loss(1.0, 2.0, 3.0)

jakevdp · 2024-01-25T13:35:46Z

jakevdp
Jan 25, 2024
Maintainer

Can I ask why you're defining primitives at all? A big reason that custom_jvp exists is to allow definition of custom derivative rules without the need to define primitives (which, as you're finding, are not trivial to implement correctly). I suspect the best solution here is to avoid the primitive definitions and instead use the impl rules directly.

4 replies

ericmjonas Jan 25, 2024
Author

I'm sorry, I should have been more clear -- this is just a toy example that showcases the behavior! My real code (primitives) is (are) a pile of C++/CUDA (a simulator) that I'm trying to call out to and is not easy to reimplement in Jax. Hence wrapping it as a primitive, like the instructions https://jax.readthedocs.io/en/latest/Custom_Operation_for_GPUs.html show. It just seemed like the relevant challenges could be highlighted with a pure-python implementation -- I don't think these issues are related to the specific backend or lowering behavior.

jakevdp Jan 25, 2024
Maintainer

Oh, in that case I think if you want to use autodiff with primitives, you'll need to implement the jvp and/or transpose rules. custom_jvp is not the right approach when you're working with custom primitives.

ericmjonas Jan 25, 2024
Author

I think you're right, and that's what I'm trying to show here, where I have implemented a custom VJP (not jvp) which then bottoms out in two primitives, a fwd and a bwd primitive. This is the approach taken in the tutorial on CUDA extensions. My confusion comes from my belief that I should not actually need a differentiation rule for the bwd operation as (at least when I do things on paper) I never end up needing more than the first derivative of myfunc.

Is there any better way of inspecting why this might be going awry? or is this just "expected behavior"? I feel like the use case of fitting to the gradient of a function is fairly common across ML and the need to use external (legacy) code is pretty common among a bunch of us trying to do scientific computing with JAX, so I hope someone else has cracked this nugget.

jakevdp Jan 25, 2024
Maintainer

Oh, I see – I've never lookd closely at that doc, but if that's what it recommends then it's probably OK.

Looking at your code, where it differs from the example in the docs is that you're attempting second-order differentiation (using a grad transformation on a function that uses a grad transformation), and double differentiation will attempt to differentiate the differentiation rule. If you want to do that as written in your example, you'll need your autodiff rules to be differentiable. I think the cleanest way to do that is to avoid custom_vjp and define your autodiff rules directly on the primitive. But you could also do that by defining custom_vjp on the functions that you pass to defjvp in order to make them compatible with second-order autodiff.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient of primitive and confusion #19514

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Gradient of primitive and confusion #19514

ericmjonas Jan 25, 2024

Replies: 1 comment · 4 replies

jakevdp Jan 25, 2024 Maintainer

ericmjonas Jan 25, 2024 Author

jakevdp Jan 25, 2024 Maintainer

ericmjonas Jan 25, 2024 Author

jakevdp Jan 25, 2024 Maintainer

ericmjonas
Jan 25, 2024

Replies: 1 comment 4 replies

jakevdp
Jan 25, 2024
Maintainer

ericmjonas Jan 25, 2024
Author

jakevdp Jan 25, 2024
Maintainer

ericmjonas Jan 25, 2024
Author

jakevdp Jan 25, 2024
Maintainer