apache · jroesch · Apr 12, 2019 · Mar 8, 2019 · Mar 9, 2019 · Mar 10, 2019
diff --git a/docs/dev/relay_add_op.rst b/docs/dev/relay_add_op.rst
@@ -139,6 +139,55 @@ before producing the call node:
         tup = Tuple(list(args))
         return _make.concat(tup)
 
+Gradient Operators
+------------------
+
+Gradient operators are important for writing differentiable programs in
+Relay. While it is the case that Relay's autodiff algorithm can differentiate
+first-class language constructs, operators are opaque. Because Relay can't
+look into the implementation, an explicit differentiation rule must be
+provided.
+
+Adding a gradient operator is slightly different from adding a normal
+operator, in that you only need to touch Python code. A collection of
+gradient operators can be found in ``python/tvm/relay/op/_tensor_grad.py``. A
+good example is the gradient for ``multiply``:
+
+.. code:: python
+
+    @register_gradient("multiply")
+    def multiply_grad(orig, grad):
+        """Returns [grad * y, grad * x]"""
+        x, y = orig.args
+        return [collapse_sum_like(grad * y, x),
+                collapse_sum_like(grad * x, y)]
+
+The inputs here are the original operator and a gradient to accumulate into.
+What we return is a list, where the 0th index is the derivative of the
+multiply operator with respect to the first input, and the 1st index is the
+derivative with respect to the second input. In general, the gradient will
+return a list with as many elements as there are inputs to the base operator.
+
+Before we further analyze this definition, first we should recall the partial
+derivatives for multiplication. Given a function f(x, y) = x * y, we have
+that ∂f/∂x = y and ∂f/∂y = x. The definition above looks similar to the math
+definitions, but there are some subtle differences, which we describe below.
+
+We're not just interested in how to compute the gradient of this function.
+We're interested in composing this gradient with other gradients, so we can
+accumulate the gradient across an entire program. This is where the ``grad *
+y`` and ``grad * x`` terms come from. We know that ∂f/∂x = y, but the way we
+compose this derivative with the gradient thus far is by multiplying it.
+
+Additionally, since the shape of ``grad`` might not match the shape of either
+of the inputs, we use ``collapse_sum_like`` to take the contents of the
+``grad * <var>`` terms and make the shape match the input we're
+differentiating with respect to. We only need to do this for operators with
+broadcasting behaviors.
+
+TODO: Why do we only have ``collapse_sum_like`` on some of the gradient
+operators in ``relay/op/_tensor_grad.py``?
+
 Summary
 -------