-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Relay] Add gradient operator tutorial docs #2751
Conversation
|
docs/dev/relay_add_op.rst
Outdated
Gradient Operators | ||
------------------ | ||
|
||
Gradient operators are important for writing differentiable programs in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically you should say that gradient operators are necessary for the AD algorithm; the AD algorithm can differentiate first-class language constructs, but because operators are opaque, it needs an explicit differentiation rule for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nits:
- Do we have any documentation of Relay's AD? It would be good to link to that
- It would be good to say that the differentiation rule is opaque to Relay as well
- I think the sentence in which you describe operators as "opaque" should also contain the clause clarifying that "opaque" means that Relay cannot look into the implementation
docs/dev/relay_add_op.rst
Outdated
|
||
Before we further analyze this definition, first we should recall the partial | ||
derivatives for multiplication. Given a function f(x, y) = x * y, we have | ||
that df/dx = y and df/dy = x. The definition above looks similar to the math |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know any way to use the proper partial derivative notation in RST? ∂f/∂x? (I don't know if just using the unicode will work.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't sure if we wanted unicode in there. I guess I could just add it and see if the docs break.
docs/dev/relay_add_op.rst
Outdated
|
||
We're not just interested in how to compute the gradient of this function. | ||
We're interested in composing this gradient with other gradients, so we can | ||
accumulate the gradient across an entire program. This is where the ``grad * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want a different choice of variable names because grad * x
could be misinterpreted as ∇ * x (div). Also I think you should use :code: in front of the backticks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure? I've looked at the source for some of the docs, and all they use are double backticks for code formatting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'm not sure what you mean when you say "∇ * x (div)".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://en.wikipedia.org/wiki/Divergence
I don't know if other docs use something different (some kind of styling) but I have stuck to using :code:
for Relay docs and found when I built docs that backticks without it ended up italicized, not with typewriter font.
docs/dev/relay_add_op.rst
Outdated
differentiating with respect to. We only need to do this for operators with | ||
broadcasting behaviors. | ||
|
||
TODO: Why do we only have ``collapse_sum_like`` on some of the gradient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collapse is only needed when you broadcast, which most binary operation implicitly do
Adding more reviewers! |
Could you try building the docs to make sure the inline ∂ symbols work? Also to make sure that the statements in backticks show up correctly in typewriter font. I haven't tried myself but I'd be concerned about those. |
docs/dev/relay_add_op.rst
Outdated
provided. | ||
|
||
Adding a gradient operator is slightly different from adding a normal | ||
operator, in that you only need to touch Python code. A collection of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be done in C++ too (but generally has not been done this way). Is there any example of a gradient registered in C++? As far as I know, it can be done using the operator registry APIs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a quick grep, I couldn't find any instances of the "FPrimalGradient" attr being set in C++, so I suspect there aren't any C++ gradient examples. I could still mention the procedure for adding one in C++ though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's what I meant, just mention it. I don't think there are presently examples, but mention that it can be done and show how in case that proves to be convenient for others.
docs/dev/relay_add_op.rst
Outdated
of the inputs, we use ``collapse_sum_like`` to take the contents of the | ||
``grad * <var>`` terms and make the shape match the input we're | ||
differentiating with respect to. We only need to do this for operators with | ||
broadcasting behaviors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should have another example that does not have the complexity of collapse_sum_like
because this seems like a fairly confusing point (at least, I'm left confused). It would also benefit maybe from some lower-level explanation (or having some math more explicitly written out).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. As far as I understand, it's to handle cases where you have a tensor x
with shape (4, 4)
and a tensor y
with shape (16, 1)
, and you add them. When you differentiate wrt. x
, you want the shape to match x
, and you want the same for y
. I could be totally wrong though.
If one of the other reviewers can confirm that that's what's going on, then I can add that explanation to the docs. Otherwise, adding another example is a good idea too.
It seems that building the docs (assuming I'm doing it correctly) requires a CUDA-enabled GPU for some reason, which my machine doesn't have. Not sure what to do about that. |
Are you sure the build completely fails without CUDA enabled? I think some of the docs actually run TVM code and use it to generate example output, but those that don't do that should build. (Also I'm pretty sure there's a LaTeX-like math mode in RST that some of the docs use, so perhaps look around for those to see a better way to encode mathematical expressions.) |
They seem to be failing, because the build directory for the docs is empty when it finishes. |
Have you tried a more targeted build that avoids the parts that require Cuda? |
Didn't know about the It seems that the unicode for partial derivatives shows up correctly, and you don't need |
I'll add another gradient example when I get a chance. Please feel free to review the current state of the PR though. |
static const Op& op = Op::Get("collapse_sum_like"); | ||
return CallNode::make(op, {e}); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this definition be in a separate PR?
Ping @ZihengJiang @antinucleon |
* Add gradient operator tutorial docs * Incorporate Steven's and Ziheng's feedback * Remove TODO about `collapse_sum_like` * Add more examples
* Add gradient operator tutorial docs * Incorporate Steven's and Ziheng's feedback * Remove TODO about `collapse_sum_like` * Add more examples
Another rough draft. Roast me.
I'm interested in the answer to the TODO at the end of the patch.
@jroesch @MarisaKirisame @slyubomirsky @ZihengJiang