Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Relay] Add gradient operator tutorial docs #2751

Merged
merged 5 commits into from
Apr 12, 2019

Conversation

weberlo
Copy link
Contributor

@weberlo weberlo commented Mar 8, 2019

Another rough draft. Roast me.

I'm interested in the answer to the TODO at the end of the patch.

@jroesch @MarisaKirisame @slyubomirsky @ZihengJiang

@ZihengJiang
Copy link
Contributor

collapse_sum_like is only for operators which have broadcast behaviors

Gradient Operators
------------------

Gradient operators are important for writing differentiable programs in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically you should say that gradient operators are necessary for the AD algorithm; the AD algorithm can differentiate first-class language constructs, but because operators are opaque, it needs an explicit differentiation rule for them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits:

  • Do we have any documentation of Relay's AD? It would be good to link to that
  • It would be good to say that the differentiation rule is opaque to Relay as well
  • I think the sentence in which you describe operators as "opaque" should also contain the clause clarifying that "opaque" means that Relay cannot look into the implementation


Before we further analyze this definition, first we should recall the partial
derivatives for multiplication. Given a function f(x, y) = x * y, we have
that df/dx = y and df/dy = x. The definition above looks similar to the math
Copy link
Contributor

@slyubomirsky slyubomirsky Mar 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know any way to use the proper partial derivative notation in RST? ∂f/∂x? (I don't know if just using the unicode will work.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't sure if we wanted unicode in there. I guess I could just add it and see if the docs break.


We're not just interested in how to compute the gradient of this function.
We're interested in composing this gradient with other gradients, so we can
accumulate the gradient across an entire program. This is where the ``grad *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want a different choice of variable names because grad * x could be misinterpreted as ∇ * x (div). Also I think you should use :code: in front of the backticks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? I've looked at the source for some of the docs, and all they use are double backticks for code formatting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm not sure what you mean when you say "∇ * x (div)".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://en.wikipedia.org/wiki/Divergence

I don't know if other docs use something different (some kind of styling) but I have stuck to using :code: for Relay docs and found when I built docs that backticks without it ended up italicized, not with typewriter font.

differentiating with respect to. We only need to do this for operators with
broadcasting behaviors.

TODO: Why do we only have ``collapse_sum_like`` on some of the gradient
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collapse is only needed when you broadcast, which most binary operation implicitly do

@weberlo
Copy link
Contributor Author

weberlo commented Mar 10, 2019

Adding more reviewers!

@junrushao1994 @antinucleon

@weberlo
Copy link
Contributor Author

weberlo commented Mar 13, 2019

@slyubomirsky
Copy link
Contributor

Could you try building the docs to make sure the inline ∂ symbols work? Also to make sure that the statements in backticks show up correctly in typewriter font. I haven't tried myself but I'd be concerned about those.

provided.

Adding a gradient operator is slightly different from adding a normal
operator, in that you only need to touch Python code. A collection of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be done in C++ too (but generally has not been done this way). Is there any example of a gradient registered in C++? As far as I know, it can be done using the operator registry APIs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a quick grep, I couldn't find any instances of the "FPrimalGradient" attr being set in C++, so I suspect there aren't any C++ gradient examples. I could still mention the procedure for adding one in C++ though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's what I meant, just mention it. I don't think there are presently examples, but mention that it can be done and show how in case that proves to be convenient for others.

of the inputs, we use ``collapse_sum_like`` to take the contents of the
``grad * <var>`` terms and make the shape match the input we're
differentiating with respect to. We only need to do this for operators with
broadcasting behaviors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should have another example that does not have the complexity of collapse_sum_like because this seems like a fairly confusing point (at least, I'm left confused). It would also benefit maybe from some lower-level explanation (or having some math more explicitly written out).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. As far as I understand, it's to handle cases where you have a tensor x with shape (4, 4) and a tensor y with shape (16, 1), and you add them. When you differentiate wrt. x, you want the shape to match x, and you want the same for y. I could be totally wrong though.

If one of the other reviewers can confirm that that's what's going on, then I can add that explanation to the docs. Otherwise, adding another example is a good idea too.

@weberlo
Copy link
Contributor Author

weberlo commented Mar 13, 2019

Could you try building the docs to make sure the inline ∂ symbols work? Also to make sure that the statements in backticks show up correctly in typewriter font. I haven't tried myself but I'd be concerned about those.

It seems that building the docs (assuming I'm doing it correctly) requires a CUDA-enabled GPU for some reason, which my machine doesn't have. Not sure what to do about that.

@slyubomirsky
Copy link
Contributor

slyubomirsky commented Mar 15, 2019

Are you sure the build completely fails without CUDA enabled? I think some of the docs actually run TVM code and use it to generate example output, but those that don't do that should build. (Also I'm pretty sure there's a LaTeX-like math mode in RST that some of the docs use, so perhaps look around for those to see a better way to encode mathematical expressions.)

@weberlo
Copy link
Contributor Author

weberlo commented Mar 18, 2019

Are you sure the build completely fails without CUDA enabled? I think some of the docs actually run TVM code and use it to generate example output, but those that don't do that should build. (Also I'm pretty sure there's a LaTeX-like math mode in RST that some of the docs use, so perhaps look around for those to see a better way to encode mathematical expressions.)

They seem to be failing, because the build directory for the docs is empty when it finishes.

@slyubomirsky
Copy link
Contributor

Have you tried a more targeted build that avoids the parts that require Cuda?

@weberlo
Copy link
Contributor Author

weberlo commented Mar 19, 2019

Have you tried a more targeted build that avoids the parts that require Cuda?

Didn't know about the TVM_TUTORIAL_EXEC_PATTERN environment variable. Thanks for the offline help, Steven.

It seems that the unicode for partial derivatives shows up correctly, and you don't need :code: for inline snippets. See the image below:

rendered_docs

@weberlo
Copy link
Contributor Author

weberlo commented Mar 19, 2019

I'll add another gradient example when I get a chance. Please feel free to review the current state of the PR though.

static const Op& op = Op::Get("collapse_sum_like");
return CallNode::make(op, {e});
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this definition be in a separate PR?

@weberlo
Copy link
Contributor Author

weberlo commented Mar 22, 2019

Ping @ZihengJiang @antinucleon

@jroesch jroesch merged commit b9349cb into apache:master Apr 12, 2019
wweic pushed a commit to wweic/tvm that referenced this pull request May 13, 2019
* Add gradient operator tutorial docs

* Incorporate Steven's and Ziheng's feedback

* Remove TODO about `collapse_sum_like`

* Add more examples
wweic pushed a commit to neo-ai/tvm that referenced this pull request May 13, 2019
* Add gradient operator tutorial docs

* Incorporate Steven's and Ziheng's feedback

* Remove TODO about `collapse_sum_like`

* Add more examples
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants