Replace `@adjoint` with `rrule` #1863

mcabbott · 2022-02-05T18:00:30Z

To allow use without Zygote, we should move to defining rules via ChainRules.

Most of these are mechanical, but perhaps deserve a quick look to see if there are tests. Comments on particular ones below.

ToucheSir · 2022-02-05T18:08:07Z

This ought to make debugging easier as well. A potential next step would be moving more bits out to NNlib(CUDA).

src/cuda/cudnn.jl

src/layers/normalise.jl

mcabbott · 2022-02-05T18:05:51Z

src/layers/recurrent.jl

+# TODO move to ChainRulesCore? 
 @adjoint function Broadcast.broadcasted(f::Recur, args...)
  Zygote.∇map(__context__, f, args...)
 end


I think the point of this is that the gradient for map reverses iteration order. That's a little dodgy, since map makes no such promise (and IIRC it only happens for some argument types, vectors but not 1-column matrices?). Should we just make broadcasting an RNN within a gradient an error?

Moved to here, which I think should give the same results, but also warn on the forward pass:

https://github.com/FluxML/Flux.jl/pull/1863/files#diff-5b453f8f7fb34afbebfc6f688a8209aa0532c8b1c3e95393f97afcbc37a473e7R41-R46

mcabbott · 2022-02-05T18:07:12Z

src/utils.jl

 @nograd modules
+ChainRulesCore.@non_differentiable modules(::Any)  # is this correct?


If the intention of modules is that something roughly like loss + sum(norm, modules(m)) should work, then doesn't this need to pass gradients through?

Good catch. I have a sinking feeling this might be one of those things that works with implicit gradients but not with explicit ones.

Likewise. Xref FluxML/Functors.jl#35 I guess -- is fmapreduce(x -> norm(x.weight), +, m; exclude = x -> x isa Dense) where we want to end up?

that would be one way of doing things. The big question with any approach is how to prevent AD from balking on the cache mutation + lookup.

mcabbott · 2022-02-05T18:09:57Z

src/losses/utils.jl

@@ -23,6 +23,9 @@ end
  res, Δ -> (nothing, Zygote.unbroadcast(x, xlogy.(Δ, y)), Zygote.unbroadcast(y, Δ .* x ./ y))
 end

+ChainRulesCore.@scalar_rule xlogy(x, y) (log(y), x/y)  # is this good enough?
+ChainRulesCore.@scalar_rule xlogx(x) (log(y) + true)


Can't literally translate broadcasted(::typeof(xlogy) rule to a Zygote-free world, as unbroadcast (which sums as necessary for mismatched shapes) belongs to Zygote.

I hope that Diffractor's broadcasting will work via @scalar_rule. But the rule as written is slightly different, as it doesn't treat Δ==0 as a strong zero, when y==0. Does that matter?

Are these needed if https://github.com/JuliaStats/LogExpFunctions.jl/blob/c8a4c28ffe7b6e4f8d5253e01cef091bb8d2f42c/src/chainrules.jl#L1-L2 is are already loaded through a transitive dep?

Flux could switch to those. It has branches not ifelse, and different NaN behaviour, not sure if that matters:

https://github.com/JuliaStats/LogExpFunctions.jl/blob/584442d9bd4c4abadfb5daed86cefa5fabfff645/src/basicfuns.jl#L17-L30

And 5 dependencies.

But for now perhaps it's evidence that the scalar rules are ok?

Are you looking to do some testing soon with this and Diffractor/not Zygote? Otherwise I think it would be cleaner to have a separate PR that removes all of the code above in favour of https://github.com/FluxML/Zygote.jl/blob/master/src/lib/logexpfunctions.jl and the @scalar_rules in LogExpFunctions.

I can remove these rules for now if you prefer. The functions ought to be differentiable without special rules, mostly. The PR just wants to translate as many things as possible over for now.

I said:

as unbroadcast (which sums as necessary for mismatched shapes)

This is wrong, because _check_sizes demands equal size, simplifying the broadcast:

https://github.com/FluxML/Flux.jl/blob/master/src/losses/utils.jl#L27

While I guess these broadcasts aren't so performance-sensitive (since there will only be one, for the whole model) it would be nice if all loss functions were all second-differentiable. Whether that already works, or needs to be done by fiddling with broadcasting, or rules for the loss functions themselves, I don't know.

ToucheSir · 2022-02-14T22:22:07Z

src/layers/normalise.jl

@@ -1,6 +1,6 @@
 istraining() = false

-@adjoint istraining() = true, _ -> nothing
+ChainRulesCore.rrule(::typeof(istraining)) = true, _ -> (NoTangent(),)


I'm surprised there isn't an equivalent for this in ChainRules already.

Somewhere I was writing a function like CRC.order().back > 0... would be good to have.

ToucheSir · 2022-02-24T05:44:39Z

bors try

bors · 2022-02-24T05:44:41Z

try

Merge conflict.

ToucheSir · 2022-02-24T05:47:25Z

If you wouldn't mind rebasing, we can get this merged assuming that fixes the cuda tests.

src/cuda/cuda.jl

Co-authored-by: Brian Chen <ToucheSir@users.noreply.github.com>

mcabbott commented Feb 5, 2022

View reviewed changes

mcabbott force-pushed the chainrules branch 2 times, most recently from 923eca0 to 0599968 Compare February 5, 2022 19:40

mcabbott added the gradients label Feb 5, 2022

ToucheSir mentioned this pull request Feb 13, 2022

Mark dropout_mask as non-differentiable #1870

Merged

2 tasks

mcabbott marked this pull request as ready for review February 14, 2022 05:05

ToucheSir reviewed Feb 14, 2022

View reviewed changes

ToucheSir mentioned this pull request Feb 14, 2022

Use LogExpFunctions for losses #1866

Open

2 tasks

mcabbott mentioned this pull request Feb 14, 2022

Add total(f, model) to replace implicit sum(f, Flux.params(model)) FluxML/Optimisers.jl#57

Open

ToucheSir closed this Feb 24, 2022

ToucheSir reopened this Feb 24, 2022

mcabbott added 2 commits February 24, 2022 00:49

replace at-adjoint with rrule

5f5a534

fixup

b52f09f

mcabbott force-pushed the chainrules branch from 6b697c3 to b52f09f Compare February 24, 2022 05:51

onecold was missing

af4dbd2

ToucheSir approved these changes Feb 24, 2022

View reviewed changes

src/cuda/cuda.jl Outdated Show resolved Hide resolved

rm comment

90f05a7

Co-authored-by: Brian Chen <ToucheSir@users.noreply.github.com>

mcabbott merged commit 525b645 into FluxML:master Feb 24, 2022

mcabbott deleted the chainrules branch February 24, 2022 15:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace `@adjoint` with `rrule` #1863

Replace `@adjoint` with `rrule` #1863

mcabbott commented Feb 5, 2022 •

edited

Loading

ToucheSir commented Feb 5, 2022

mcabbott Feb 5, 2022

mcabbott Feb 5, 2022

mcabbott Feb 5, 2022

ToucheSir Feb 5, 2022

mcabbott Feb 5, 2022

ToucheSir Feb 5, 2022

mcabbott Feb 5, 2022

ToucheSir Feb 5, 2022

mcabbott Feb 5, 2022

mcabbott Feb 5, 2022

ToucheSir Feb 5, 2022

mcabbott Feb 5, 2022

ToucheSir Feb 5, 2022

mcabbott Feb 15, 2022

ToucheSir Feb 14, 2022

mcabbott Feb 14, 2022

ToucheSir commented Feb 24, 2022

bors bot commented Feb 24, 2022

ToucheSir commented Feb 24, 2022

		@nograd modules
		ChainRulesCore.@non_differentiable modules(::Any) # is this correct?

Replace @adjoint with rrule #1863

Replace @adjoint with rrule #1863

Conversation

mcabbott commented Feb 5, 2022 • edited Loading

ToucheSir commented Feb 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ToucheSir commented Feb 24, 2022

bors bot commented Feb 24, 2022

try

ToucheSir commented Feb 24, 2022

Replace `@adjoint` with `rrule` #1863

Replace `@adjoint` with `rrule` #1863

mcabbott commented Feb 5, 2022 •

edited

Loading