-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Relay] Add gradient operator tutorial docs #2751
Changes from 2 commits
4eb4c40
7a53ceb
bbd509b
40be544
7f1c1b0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -139,6 +139,55 @@ before producing the call node: | |
tup = Tuple(list(args)) | ||
return _make.concat(tup) | ||
|
||
Gradient Operators | ||
------------------ | ||
|
||
Gradient operators are important for writing differentiable programs in | ||
Relay. While it is the case that Relay's autodiff algorithm can differentiate | ||
first-class language constructs, operators are opaque. Because Relay can't | ||
look into the implementation, an explicit differentiation rule must be | ||
provided. | ||
|
||
Adding a gradient operator is slightly different from adding a normal | ||
operator, in that you only need to touch Python code. A collection of | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It can be done in C++ too (but generally has not been done this way). Is there any example of a gradient registered in C++? As far as I know, it can be done using the operator registry APIs There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With a quick grep, I couldn't find any instances of the "FPrimalGradient" attr being set in C++, so I suspect there aren't any C++ gradient examples. I could still mention the procedure for adding one in C++ though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, that's what I meant, just mention it. I don't think there are presently examples, but mention that it can be done and show how in case that proves to be convenient for others. |
||
gradient operators can be found in ``python/tvm/relay/op/_tensor_grad.py``. A | ||
good example is the gradient for ``multiply``: | ||
|
||
.. code:: python | ||
|
||
@register_gradient("multiply") | ||
def multiply_grad(orig, grad): | ||
"""Returns [grad * y, grad * x]""" | ||
x, y = orig.args | ||
return [collapse_sum_like(grad * y, x), | ||
collapse_sum_like(grad * x, y)] | ||
|
||
The inputs here are the original operator and a gradient to accumulate into. | ||
What we return is a list, where the 0th index is the derivative of the | ||
multiply operator with respect to the first input, and the 1st index is the | ||
derivative with respect to the second input. In general, the gradient will | ||
return a list with as many elements as there are inputs to the base operator. | ||
|
||
Before we further analyze this definition, first we should recall the partial | ||
derivatives for multiplication. Given a function f(x, y) = x * y, we have | ||
that ∂f/∂x = y and ∂f/∂y = x. The definition above looks similar to the math | ||
definitions, but there are some subtle differences, which we describe below. | ||
|
||
We're not just interested in how to compute the gradient of this function. | ||
We're interested in composing this gradient with other gradients, so we can | ||
accumulate the gradient across an entire program. This is where the ``grad * | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We may want a different choice of variable names because There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you sure? I've looked at the source for some of the docs, and all they use are double backticks for code formatting. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, I'm not sure what you mean when you say "∇ * x (div)". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. https://en.wikipedia.org/wiki/Divergence I don't know if other docs use something different (some kind of styling) but I have stuck to using |
||
y`` and ``grad * x`` terms come from. We know that ∂f/∂x = y, but the way we | ||
compose this derivative with the gradient thus far is by multiplying it. | ||
|
||
Additionally, since the shape of ``grad`` might not match the shape of either | ||
of the inputs, we use ``collapse_sum_like`` to take the contents of the | ||
``grad * <var>`` terms and make the shape match the input we're | ||
differentiating with respect to. We only need to do this for operators with | ||
broadcasting behaviors. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you should have another example that does not have the complexity of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. As far as I understand, it's to handle cases where you have a tensor If one of the other reviewers can confirm that that's what's going on, then I can add that explanation to the docs. Otherwise, adding another example is a good idea too. |
||
|
||
TODO: Why do we only have ``collapse_sum_like`` on some of the gradient | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. collapse is only needed when you broadcast, which most binary operation implicitly do |
||
operators in ``relay/op/_tensor_grad.py``? | ||
|
||
Summary | ||
------- | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically you should say that gradient operators are necessary for the AD algorithm; the AD algorithm can differentiate first-class language constructs, but because operators are opaque, it needs an explicit differentiation rule for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nits: