Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Relay] Add gradient operator tutorial docs #2751

Merged
merged 5 commits into from
Apr 12, 2019
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions docs/dev/relay_add_op.rst
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,55 @@ before producing the call node:
tup = Tuple(list(args))
return _make.concat(tup)

Gradient Operators
------------------

Gradient operators are important for writing differentiable programs in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically you should say that gradient operators are necessary for the AD algorithm; the AD algorithm can differentiate first-class language constructs, but because operators are opaque, it needs an explicit differentiation rule for them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits:

  • Do we have any documentation of Relay's AD? It would be good to link to that
  • It would be good to say that the differentiation rule is opaque to Relay as well
  • I think the sentence in which you describe operators as "opaque" should also contain the clause clarifying that "opaque" means that Relay cannot look into the implementation

Relay. While it is the case that Relay's autodiff algorithm can differentiate
first-class language constructs, operators are opaque. Because Relay can't
look into the implementation, an explicit differentiation rule must be
provided.

Adding a gradient operator is slightly different from adding a normal
operator, in that you only need to touch Python code. A collection of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be done in C++ too (but generally has not been done this way). Is there any example of a gradient registered in C++? As far as I know, it can be done using the operator registry APIs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a quick grep, I couldn't find any instances of the "FPrimalGradient" attr being set in C++, so I suspect there aren't any C++ gradient examples. I could still mention the procedure for adding one in C++ though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's what I meant, just mention it. I don't think there are presently examples, but mention that it can be done and show how in case that proves to be convenient for others.

gradient operators can be found in ``python/tvm/relay/op/_tensor_grad.py``. A
good example is the gradient for ``multiply``:

.. code:: python

@register_gradient("multiply")
def multiply_grad(orig, grad):
"""Returns [grad * y, grad * x]"""
x, y = orig.args
return [collapse_sum_like(grad * y, x),
collapse_sum_like(grad * x, y)]

The inputs here are the original operator and a gradient to accumulate into.
What we return is a list, where the 0th index is the derivative of the
multiply operator with respect to the first input, and the 1st index is the
derivative with respect to the second input. In general, the gradient will
return a list with as many elements as there are inputs to the base operator.

Before we further analyze this definition, first we should recall the partial
derivatives for multiplication. Given a function f(x, y) = x * y, we have
that ∂f/∂x = y and ∂f/∂y = x. The definition above looks similar to the math
definitions, but there are some subtle differences, which we describe below.

We're not just interested in how to compute the gradient of this function.
We're interested in composing this gradient with other gradients, so we can
accumulate the gradient across an entire program. This is where the ``grad *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want a different choice of variable names because grad * x could be misinterpreted as ∇ * x (div). Also I think you should use :code: in front of the backticks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? I've looked at the source for some of the docs, and all they use are double backticks for code formatting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm not sure what you mean when you say "∇ * x (div)".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://en.wikipedia.org/wiki/Divergence

I don't know if other docs use something different (some kind of styling) but I have stuck to using :code: for Relay docs and found when I built docs that backticks without it ended up italicized, not with typewriter font.

y`` and ``grad * x`` terms come from. We know that ∂f/∂x = y, but the way we
compose this derivative with the gradient thus far is by multiplying it.

Additionally, since the shape of ``grad`` might not match the shape of either
of the inputs, we use ``collapse_sum_like`` to take the contents of the
``grad * <var>`` terms and make the shape match the input we're
differentiating with respect to. We only need to do this for operators with
broadcasting behaviors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should have another example that does not have the complexity of collapse_sum_like because this seems like a fairly confusing point (at least, I'm left confused). It would also benefit maybe from some lower-level explanation (or having some math more explicitly written out).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. As far as I understand, it's to handle cases where you have a tensor x with shape (4, 4) and a tensor y with shape (16, 1), and you add them. When you differentiate wrt. x, you want the shape to match x, and you want the same for y. I could be totally wrong though.

If one of the other reviewers can confirm that that's what's going on, then I can add that explanation to the docs. Otherwise, adding another example is a good idea too.


TODO: Why do we only have ``collapse_sum_like`` on some of the gradient
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collapse is only needed when you broadcast, which most binary operation implicitly do

operators in ``relay/op/_tensor_grad.py``?

Summary
-------

Expand Down