Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc #9038

Merged
merged 2 commits into from
Sep 29, 2021

Conversation

mbs-octoml
Copy link
Contributor

Currently LowerTEPass (backend/te_compiler.cc) is a 'special' pass because it
depends on a side-input DeviceMap. We'd like to remove that side-input, and
instead recover the Device (and, ultimately, Target) for each (fused) primitive
call from the AST alone.

By doing so we also avoid needing to perform device planning twice:

  • It needs to be done before lowering so we know which primitives need
    to be compiled for which devices.
  • It then needs to be re-done after lowering and optimization as a prelude
    to memory planning.
    By baking the device plan into the AST we can simply do device planning before
    lowering, and run memory planning later, both as ordinary passes.

While working on that issue we realized we currently have 3 'device planners':

  • transforms/device_annotation.cc, which supports only a small subset of Relay
    and uses a simple top-down algorithm to assign a device to every
    sub-expression.
  • analysis/context_analysis.cc, which makes a galant effort to support most of
    Relay, is based on unification rather than a top-down algorithm, but handles
    higher order functions by ad-hoc and fragile inlining.
  • transforms/annotate_target.cc, which works on Targets instead of Devices, but
    is otherwise like 'device planning'.
    We'd like to bring these together.

In this PR we introduce a new transforms/device_planner.cc intended to replace
transforms/device_annotation.cc and analysis/context_analysis.cc. We don't
delete those two just yet since we need to switch all users off of them in the
next PR. We also leave transforms/annotate_target.cc alone pending a proper RFC
to bring devices and targets together sensibly, but have it firmly in our
sights.

transforms/device_planner.cc is based on analysis/context_analysis.cc, but
is heavily reworked to:

  1. Handle starting from existing "on_device" annotations as well as existing
    "device_copy" calls.
  2. Be idempotent, with the idea we'll probably need to re-run it to 'refine'
    device planning to account for storge scopes.
  3. Robustly handle all of Relay, particularly higher-order functions. For that
    we replace the inlining approach in analysis/context_analysis.cc with a
    higher-order unification domain.
  4. Be a little more systematic with defaulting.
  5. Capture the result of the analysis within the AST as new "device_copy" calls
    at device boundaries, and new/replaced "on_device" calls wherever the device
    for a sub-expression is not already 'obvious' from the sub-expression's
    lexical scope.
  6. Provide helper visitors for passes which need to ask for the device for
    any sub-expression they are processing and/or preserve device information
    on rewrites. Those passes include:
    • backend/aot_executor_codegen.cc (AOTOnDemandAllocator)
    • backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc)
    • backend/te_compiler.cc (LowerTensorExprMutator)
    • backend/vm/lambda_lift.cc (LambdaLifter)
    • transforms/memory_alloc.cc (DialectRewriter)
    • transforms/to_a_normal_form.cc (Fill)
    • backend/vm/compiler.cc (VMFunctionCompiler)
      However we won't change any of those in this PR.

See the draft #8788 for the end game, I'll be peeling PRs out of that.

@mbs-octoml
Copy link
Contributor Author

@electriclilies @mikepapadim @jroesch Here's part 1. Things to watch out for:

  • How to I may device_copy.cc branch from context_analysis.cc?
  • Did I go too far with device defaulting?
  • All the visitor helpers are needed in follow up but not used here, so sorry it's not obvious why I set things up like that. But suggestion for reducing code dup would be welcome.
  • I tried to make the intro comment comprehensive, let me know what needs more detail.
  • Anything non-obvious.
    Thanks!

@mbs-octoml
Copy link
Contributor Author

Oh, and this is on top of #9012 which is still in-flight, sorry about that.

@@ -0,0 +1,1405 @@
# Licensed to the Apache Software Foundation (ASF) under one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this is a test file!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to rain on your parade but it might be worth directly using text format here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's pretty awful and I nearly switched a few times already. Which format should I use? Can it capture the attributes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for relay text format, i think that would greatly help

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

*/

/*!
* \file src/relay/analysis/device_planner.cc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read everything except main file, will tackle this part tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. With apologies, but I ended up hoisting out everything except the main into #9077, so I think you could directly LGTM that now?

I'm trying to model good PR hygiene :-)

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbs-octoml did an initial pass, don't think i understand everything yet, but i can see the pieces here. thanks for doing this!!

int device_type;
/*!
* \brief If true, the result device must also be \p device_type and device planning should
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure i quite understand--isn't this what the struct-level brief above says it does? what happens if is_fixed == false? if we are going to insert a device copy node, that implies the data must be needed elsewhere. how can we merely attach an attribute to a result saying a copy is simply out of the question?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the association of expression to device is a 'constraint', and by default "on_device" on constrains the inner expr and not the overall expr. However to make plan_devices idempotent with the current approach I need some way to fully specify devices.



# for testing only
def function_on_device(function, param_devices, result_device):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we call this testonly_function_on_device then or place in python/tvm/relay/op/annotation/test_utils.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up fully commenting this one in a prior PR.

return Call(OnDeviceOp(), {std::move(expr)}, Attrs(std::move(attrs)), /*type_args=*/{}, span);
}

Expr OptOnDevice(Expr expr, DLDeviceType device_type, bool is_fixed) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opt=Optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, old personal naming convention but it clashes with Optimize. Switching to Maybe (in the supporting PR).

.set_attr<FTVMCompute>("FTVMCompute",
[](const Attrs& attrs, const Array<te::Tensor>& inputs,
const Type& out_type) -> Array<te::Tensor> {
return {topi::identity(inputs[0])};
});

OnDeviceProps GetOnDeviceProps(const CallNode* call_node) {
if (call_node->op == OnDeviceOp()) {
ICHECK_EQ(call_node->args.size(), 1) << "on_device expects one argument";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

want to print the args or op node in this case to help debug? same question below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gotten all the way through to unit testing the final PR without needing it, so how about I leave it. I tend to add them as I need them.

*/
OnDeviceProps GetOnDeviceProps(const Expr& expr);

/*! \brief Returns true if \p expr is an on_device CallNode. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will CHECK-fail if there is an ill-formed on_device CallNode tho, not sure if that should get documented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out this is dead code now so nuked.

/*!
* \brief Rewrites "on_device" calls to handle some special cases.
*/
class RewriteOnDevices : public ExprMutator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dang this file is getting a bit long..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah might be worth splitting out into like device_domain.{h, cc} and then putting the core pass parts in their own file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, will do.

domains_->Unify(func_domain, implied_domain); // higher-order
} catch (const Error& e) {
// TODO(mbs): Proper diagnostics.
LOG(FATAL) << "Function parameters and result devices do not match those of call. Call:"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!! fatal! probably we need to address the TODO before merging...i think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just need to handle these when we replace the existing passes with this one, overall would be fine to build action item list from this, land a version and then let mark work on follow up to remove passes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a big improvement on the context_analysis.cc (which will CHECK fail with no context). Let's do this as CORE-75.

return FunctionOnDevice(func, param_device_types, result_device_type);
}

Expr VisitExpr_(const CallNode* call_node) final {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i roughly understand what's happening, but i think example Relay snippets would help to better understand/maintain each Pass here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some simple examples to each of the internal pass class comments.

@@ -0,0 +1,1405 @@
# Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for relay text format, i think that would greatly help



if __name__ == "__main__":
test_plain()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sys.exit(pytest.main([__file__] + sys.argv[1:]))

one day we will write a helper :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping pong on this one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Sep 26, 2021
These changes came from changing apache#9038 to use
tvm.parser.fromtext instead of manual AST construction.

- Demote FunctionOnDeviceAttrs to just a pair of DictAttrs entries so
  that the parser will understand them on Function definitions.
- Connect some special operators to their attributes so parsing understands them
  at call sites.
- Don't silently ignore attributes during parsing.
- Implement OptFunctionOnDevice so won't add device annotations for kUnknownDeviceType.
- Allow the parser to be given an initial metadata map to support examples which
  need constants.
- More DLOG -> VLOG conversions to reduce debug clutter.
@mbs-octoml
Copy link
Contributor Author

While converting the unit tests to script form I ended up needing #9130.

I'll work through the remaining comments now.

Copy link
Member

@jroesch jroesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think everything makes sense, big asks would to be splitting up the passes into smaller files with a better separation of the pieces and adding diagnostics and other features. I am totally OK landing this version and building action item list to do before we turn this on as default pass.

@@ -1088,7 +1088,7 @@ class Parser {
VLOG(0) << "Parser::ParseFunctionDef";
return WithSpan<Function>([&]() {
PushScope();
PushTypeScope();
PushTypeScope(); // TODO(mbs): BUG?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the second PushTypeScope() is in error and I'll delete it. You unconditionally pop one scope further down.

attrs->result_device_type = result_device_type;
return WithAttr(std::move(function), attr::kFunctionAttrsKey, Attrs(std::move(attrs)));
Integer result_device_type) {
return WithAttr(WithAttr(std::move(function), tvm::attr::kParamDeviceTypes, param_device_types),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we maybe add a WithAttr overload that takes a Map instead of calling multiple times?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.



if __name__ == "__main__":
test_plain()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping pong on this one

* -------
* After flowing constraints we apply some defaulting heuristics (using a global default device)
* to fix the device for any as-yet unconstrained sub-expressions.
* - Unconstrained function result devices default to the global default device.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the "today behavior" technically, we should probably discuss what people expect.

* - Unconstrained function result devices default to the global default device.
* - Unconstrained function parameters devices default to the device for the function result.
* - Unconstrained let-bound expression devices default to the device for the overall let.
* TODO(mbs): I may have over-innovated here and we simply want to bind all free domaints to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the code is setup to "innovate" here from my PoV but today's behavior is super conservative everything defaults to default device which is usually host.

* incorporate targets and memory scopes into the domain. For example it's ok for the function
* body to be executed on different device ids provided they have the same target and memory
* scope.
* * Might be simpler to just let every type have a device annotation rather than work in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have discussed this one before

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (lhs->device_type_ != rhs->device_type_) {
// TODO(mbs): Proper diagnostics.
std::ostringstream os;
os << "Inconsistent device types " << lhs->device_type_ << " and " << rhs->device_type_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think we can do this, just need to do some plumbing, Mark and I can add tracking for this as part of turning this thing online.

* callee is a primitive or special operation we handle it specially. Otherwise defers to \p
* DomainFor(call->op).
*
* This special handling is needed:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a hook here, but I would like to first make sure everyone is using the same algorithm before we provide anymore customization.

/*!
* \brief Rewrites "on_device" calls to handle some special cases.
*/
class RewriteOnDevices : public ExprMutator {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah might be worth splitting out into like device_domain.{h, cc} and then putting the core pass parts in their own file.

domains_->Unify(func_domain, implied_domain); // higher-order
} catch (const Error& e) {
// TODO(mbs): Proper diagnostics.
LOG(FATAL) << "Function parameters and result devices do not match those of call. Call:"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just need to handle these when we replace the existing passes with this one, overall would be fine to build action item list from this, land a version and then let mark work on follow up to remove passes.

mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Sep 27, 2021
mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Sep 27, 2021
jroesch pushed a commit that referenced this pull request Sep 28, 2021
* Prepare for new plan_devices.cc (part II)

These changes came from changing #9038 to use
tvm.parser.fromtext instead of manual AST construction.

- Demote FunctionOnDeviceAttrs to just a pair of DictAttrs entries so
  that the parser will understand them on Function definitions.
- Connect some special operators to their attributes so parsing understands them
  at call sites.
- Don't silently ignore attributes during parsing.
- Implement OptFunctionOnDevice so won't add device annotations for kUnknownDeviceType.
- Allow the parser to be given an initial metadata map to support examples which
  need constants.
- More DLOG -> VLOG conversions to reduce debug clutter.

* [checkpoint] Keep existing ParseModule ffi to simplify rust bindings

* [checkpoint] Address Christopher's comments.

* [checkpoint] Andrew's comments from #9038

* [checkpoint] Jared's comments from #9038

* [checkpoint] Woops, forgot rename.
…tation.cc

Currently LowerTEPass (backend/te_compiler.cc) is a 'special' pass because it
depends on a side-input DeviceMap. We'd like to remove that side-input, and
instead recover the Device (and, ultimately, Target) for each (fused) primitive
call from the AST alone.

By doing so we also avoid needing to perform device planning twice:
 - It needs to be done before lowering so we know which primitives need
   to be compiled for which devices.
 - It then needs to be re-done after lowering and optimization as a prelude
   to memory planning.
By baking the device plan into the AST we can simply do device planning before
lowering, and run memory planning later, both as ordinary passes.

While working on that issue we realized we currently have 3 'device planners':
 - transforms/device_annotation.cc, which supports only a small subset of Relay
   and uses a simple top-down algorithm to assign a device to every
   sub-expression.
 - analysis/context_analysis.cc, which makes a galant effort to support most of
   Relay, is based on unification rather than a top-down algorithm, but handles
   higher order functions by ad-hoc and fragile inlining.
 - transforms/annotate_target.cc, which works on Targets instead of Devices, but
   is otherwise like 'device planning'.
We'd like to bring these together.

In this PR we introduce a new transforms/device_planner.cc intended to replace
transforms/device_annotation.cc and analysis/context_analysis.cc. We don't
delete those two just yet since we need to switch all users off of them in the
next PR. We also leave transforms/annotate_target.cc alone pending a proper RFC
to bring devices and targets together sensibly, but have it firmly in our
sights.

transforms/device_planner.cc is based on analysis/context_analysis.cc, but
is heavily reworked to:
 1. Handle starting from existing "on_device" annotations as well as existing
    "device_copy" calls.
 2. Be idempotent, with the idea we'll probably need to re-run it to 'refine'
    device planning to account for storge scopes.
 3. Robustly handle all of Relay, particularly higher-order functions. For that
    we replace the inlining approach in analysis/context_analysis.cc with a
    higher-order unification domain.
 4. Be a little more systematic with defaulting.
 5. Capture the result of the analysis within the AST as new "device_copy" calls
    at device boundaries, and new/replaced "on_device" calls wherever the device
    for a sub-expression is not already 'obvious' from the sub-expression's
    lexical scope.
 6. Provide helper visitors for passes which need to ask for the device for
    any sub-expression they are processing and/or preserve device information
    on rewrites. Those passes include:
     - backend/aot_executor_codegen.cc (AOTOnDemandAllocator)
     - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc)
     - backend/te_compiler.cc (LowerTensorExprMutator)
     - backend/vm/lambda_lift.cc (LambdaLifter)
     - transforms/memory_alloc.cc (DialectRewriter)
     - transforms/to_a_normal_form.cc (Fill)
     - backend/vm/compiler.cc (VMFunctionCompiler)
    However we won't change any of those in this PR.

See the draft apache#8788 for the end game.

* [checkpoint] Use Relay script for all unit tests.

* [checkpoint] Hoist out DeviceDomain and DeviceDomains.

* [checkpoint] Hoist out visitors
@mbs-octoml
Copy link
Contributor Author

mbs-octoml commented Sep 28, 2021

@areusch, @jroesch thank you and PTAL:

  • Rebased on main
  • All unit tests (plus a new one for match :-) in script form
  • Split out device domains and device-aware visitors into separate files
  • Cpp unit test for device domains to prompt us to better support cpp unit tests throughout
  • Addressed above comments.

@jroesch
Copy link
Member

jroesch commented Sep 29, 2021

I am going to land this, if people have follow-ups we can do them in the final PR which actually turns this on.

@jroesch jroesch merged commit d7a28f9 into apache:main Sep 29, 2021
@mbs-octoml mbs-octoml deleted the mbs-plan-devices branch September 29, 2021 17:22
AndrewZhaoLuo added a commit to AndrewZhaoLuo/tvm that referenced this pull request Sep 29, 2021
* main:
  Fix flaky NMS test by making sure scores are unique (apache#9140)
  [Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc (apache#9038)
  [LLVM] Make changes needed for opaque pointers (apache#9138)
  Arm(R) Ethos(TM)-U NPU codegen integration (apache#8849)
  [CI] Split Integration tests out of first phase of pipeline (apache#9128)
  [Meta Schedule][M3b] Runner (apache#9111)
  Fix Google Mock differences between Ubuntu 18.04 and 16.04 (apache#9141)
  [TIR] add loop partition hint pragma (apache#9121)
  fix things (apache#9146)
  [Meta Schedule][M3a] SearchStrategy (apache#9132)
  [Frontend][PyTorch] support for quantized conv_transpose2d op (apache#9133)
  [UnitTest] Parametrized test_conv2d_int8_intrinsics (apache#9143)
  [OpenCL] Remove redundant visit statement in CodeGen. (apache#9144)
  [BYOC] support arbitrary input dims for add/mul/relu of dnnl c_src codegen (apache#9127)
  [Relay][ConvertLayout] Support for qnn.conv2d_transpose (apache#9139)
  add nn.global_avgpool to fq2i (apache#9137)
  [UnitTests] Enable minimum testing on Vulkan target in CI (apache#9093)
  [Torch] Support returning quantized weights and bias for BYOC use cases (apache#9135)
  [Relay] Prepare for new plan_devices.cc (part II) (apache#9130)
  [microTVM][Zephyr] Add MIMXRT1050 board support (apache#9068)
ylc pushed a commit to ylc/tvm that referenced this pull request Sep 29, 2021
* Prepare for new plan_devices.cc (part II)

These changes came from changing apache#9038 to use
tvm.parser.fromtext instead of manual AST construction.

- Demote FunctionOnDeviceAttrs to just a pair of DictAttrs entries so
  that the parser will understand them on Function definitions.
- Connect some special operators to their attributes so parsing understands them
  at call sites.
- Don't silently ignore attributes during parsing.
- Implement OptFunctionOnDevice so won't add device annotations for kUnknownDeviceType.
- Allow the parser to be given an initial metadata map to support examples which
  need constants.
- More DLOG -> VLOG conversions to reduce debug clutter.

* [checkpoint] Keep existing ParseModule ffi to simplify rust bindings

* [checkpoint] Address Christopher's comments.

* [checkpoint] Andrew's comments from apache#9038

* [checkpoint] Jared's comments from apache#9038

* [checkpoint] Woops, forgot rename.
AndrewZhaoLuo added a commit to AndrewZhaoLuo/tvm that referenced this pull request Sep 30, 2021
* main: (80 commits)
  Introduce centralised name transformation functions (apache#9088)
  [OpenCL] Add vectorization to cuda conv2d_nhwc schedule (apache#8636)
  [6/6] Arm(R) Ethos(TM)-U NPU codegen integration with `tvmc` (apache#8854)
  [microTVM] Add wrapper for creating project using a MLF (apache#9090)
  Fix typo (apache#9156)
  [Hotfix][Testing] Wait for RPCServer to be established (apache#9150)
  Update find cublas so it search default path if needed. (apache#9149)
  [TIR][LowerMatchBuffer] Fix lowering strides when source region has higher dimension than the buffer (apache#9145)
  Fix flaky NMS test by making sure scores are unique (apache#9140)
  [Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc (apache#9038)
  [LLVM] Make changes needed for opaque pointers (apache#9138)
  Arm(R) Ethos(TM)-U NPU codegen integration (apache#8849)
  [CI] Split Integration tests out of first phase of pipeline (apache#9128)
  [Meta Schedule][M3b] Runner (apache#9111)
  Fix Google Mock differences between Ubuntu 18.04 and 16.04 (apache#9141)
  [TIR] add loop partition hint pragma (apache#9121)
  fix things (apache#9146)
  [Meta Schedule][M3a] SearchStrategy (apache#9132)
  [Frontend][PyTorch] support for quantized conv_transpose2d op (apache#9133)
  [UnitTest] Parametrized test_conv2d_int8_intrinsics (apache#9143)
  ...
mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Oct 1, 2021
device_planner.cc from apache#9038, and finally remove DeviceMap from the
LowerTE Pass.

- We retire analysis/context_analysis.cc and
  transforms/device_annotation.cc (and their tests). That
  includes the CollectDeviceInfo, CollectDeviceAnnotationOps and
  ContextAnalysis entry points. These are all subsumed by the
  PlanDevices pass and the device aware visitors.
- The following passes now use the new 'Device Aware' visitors to
  recover the device for every Relay sub-expression:
     - backend/aot_executor_codegen.cc (AOTOnDemandAllocator)
     - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc)
     - backend/te_compiler.cc (LowerTensorExprMutator)
     - transforms/memory_alloc.cc (DialectRewriter)
     - backend/vm/compiler.cc (VMFunctionCompiler)
- The following passes/utils must maintain the device information
  encoded by the device planner within "on_device" annotations and
  "param_device_types"/"result_device_type" function attributes:
     - backend/vm/lambda_lift.cc (LambdaLifter)
     - transforms/to_a_normal_form.cc (Fill)
     - ir/expr_functior.cc (Bind)
- Remove a lot ad-hoc 'homogeneous' vs 'hetrogeneous' conditionals
  in favor of just asking for the device. Also removed a lot of ad-doc
  encodings of the 'default' device.
- We no longer need to run device-planning twice (before and after
  lowering). Device planning is also decoupled from memory planning.
- The LowerTE Pass no longer needs an expression-to-device side table
  (which was the problem which kicked this series of PRs off in the first place).
mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Oct 2, 2021
device_planner.cc from apache#9038, and finally remove DeviceMap from the
LowerTE Pass.

- We retire analysis/context_analysis.cc and
  transforms/device_annotation.cc (and their tests). That
  includes the CollectDeviceInfo, CollectDeviceAnnotationOps and
  ContextAnalysis entry points. These are all subsumed by the
  PlanDevices pass and the device aware visitors.
- The following passes now use the new 'Device Aware' visitors to
  recover the device for every Relay sub-expression:
     - backend/aot_executor_codegen.cc (AOTOnDemandAllocator)
     - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc)
     - backend/te_compiler.cc (LowerTensorExprMutator)
     - transforms/memory_alloc.cc (DialectRewriter)
     - backend/vm/compiler.cc (VMFunctionCompiler)
- The following passes/utils must maintain the device information
  encoded by the device planner within "on_device" annotations and
  "param_device_types"/"result_device_type" function attributes:
     - backend/vm/lambda_lift.cc (LambdaLifter)
     - transforms/to_a_normal_form.cc (Fill)
     - ir/expr_functior.cc (Bind)
- Remove a lot ad-hoc 'homogeneous' vs 'hetrogeneous' conditionals
  in favor of just asking for the device. Also removed a lot of ad-doc
  encodings of the 'default' device.
- We no longer need to run device-planning twice (before and after
  lowering). Device planning is also decoupled from memory planning.
- The LowerTE Pass no longer needs an expression-to-device side table
  (which was the problem which kicked this series of PRs off in the first place).
jroesch pushed a commit that referenced this pull request Oct 4, 2021
* [Relay] Switch the graph, VM and AOT executors to use the merged
device_planner.cc from #9038, and finally remove DeviceMap from the
LowerTE Pass.

- We retire analysis/context_analysis.cc and
  transforms/device_annotation.cc (and their tests). That
  includes the CollectDeviceInfo, CollectDeviceAnnotationOps and
  ContextAnalysis entry points. These are all subsumed by the
  PlanDevices pass and the device aware visitors.
- The following passes now use the new 'Device Aware' visitors to
  recover the device for every Relay sub-expression:
     - backend/aot_executor_codegen.cc (AOTOnDemandAllocator)
     - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc)
     - backend/te_compiler.cc (LowerTensorExprMutator)
     - transforms/memory_alloc.cc (DialectRewriter)
     - backend/vm/compiler.cc (VMFunctionCompiler)
- The following passes/utils must maintain the device information
  encoded by the device planner within "on_device" annotations and
  "param_device_types"/"result_device_type" function attributes:
     - backend/vm/lambda_lift.cc (LambdaLifter)
     - transforms/to_a_normal_form.cc (Fill)
     - ir/expr_functior.cc (Bind)
- Remove a lot ad-hoc 'homogeneous' vs 'hetrogeneous' conditionals
  in favor of just asking for the device. Also removed a lot of ad-doc
  encodings of the 'default' device.
- We no longer need to run device-planning twice (before and after
  lowering). Device planning is also decoupled from memory planning.
- The LowerTE Pass no longer needs an expression-to-device side table
  (which was the problem which kicked this series of PRs off in the first place).

* [checkpoint] Revert unnecessary changes

- Started down multi-target handling in interpreter but didn't finish
- Some one-off debug stuff

* [checkpoint] TODO's for default device logic
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 7, 2022
…tation.cc (apache#9038)

* [Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc

Currently LowerTEPass (backend/te_compiler.cc) is a 'special' pass because it
depends on a side-input DeviceMap. We'd like to remove that side-input, and
instead recover the Device (and, ultimately, Target) for each (fused) primitive
call from the AST alone.

By doing so we also avoid needing to perform device planning twice:
 - It needs to be done before lowering so we know which primitives need
   to be compiled for which devices.
 - It then needs to be re-done after lowering and optimization as a prelude
   to memory planning.
By baking the device plan into the AST we can simply do device planning before
lowering, and run memory planning later, both as ordinary passes.

While working on that issue we realized we currently have 3 'device planners':
 - transforms/device_annotation.cc, which supports only a small subset of Relay
   and uses a simple top-down algorithm to assign a device to every
   sub-expression.
 - analysis/context_analysis.cc, which makes a galant effort to support most of
   Relay, is based on unification rather than a top-down algorithm, but handles
   higher order functions by ad-hoc and fragile inlining.
 - transforms/annotate_target.cc, which works on Targets instead of Devices, but
   is otherwise like 'device planning'.
We'd like to bring these together.

In this PR we introduce a new transforms/device_planner.cc intended to replace
transforms/device_annotation.cc and analysis/context_analysis.cc. We don't
delete those two just yet since we need to switch all users off of them in the
next PR. We also leave transforms/annotate_target.cc alone pending a proper RFC
to bring devices and targets together sensibly, but have it firmly in our
sights.

transforms/device_planner.cc is based on analysis/context_analysis.cc, but
is heavily reworked to:
 1. Handle starting from existing "on_device" annotations as well as existing
    "device_copy" calls.
 2. Be idempotent, with the idea we'll probably need to re-run it to 'refine'
    device planning to account for storge scopes.
 3. Robustly handle all of Relay, particularly higher-order functions. For that
    we replace the inlining approach in analysis/context_analysis.cc with a
    higher-order unification domain.
 4. Be a little more systematic with defaulting.
 5. Capture the result of the analysis within the AST as new "device_copy" calls
    at device boundaries, and new/replaced "on_device" calls wherever the device
    for a sub-expression is not already 'obvious' from the sub-expression's
    lexical scope.
 6. Provide helper visitors for passes which need to ask for the device for
    any sub-expression they are processing and/or preserve device information
    on rewrites. Those passes include:
     - backend/aot_executor_codegen.cc (AOTOnDemandAllocator)
     - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc)
     - backend/te_compiler.cc (LowerTensorExprMutator)
     - backend/vm/lambda_lift.cc (LambdaLifter)
     - transforms/memory_alloc.cc (DialectRewriter)
     - transforms/to_a_normal_form.cc (Fill)
     - backend/vm/compiler.cc (VMFunctionCompiler)
    However we won't change any of those in this PR.

See the draft apache#8788 for the end game.

* [checkpoint] Use Relay script for all unit tests.

* [checkpoint] Hoist out DeviceDomain and DeviceDomains.

* [checkpoint] Hoist out visitors

* [checkpoint] Woops, left debug-only code in
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 7, 2022
* [Relay] Switch the graph, VM and AOT executors to use the merged
device_planner.cc from apache#9038, and finally remove DeviceMap from the
LowerTE Pass.

- We retire analysis/context_analysis.cc and
  transforms/device_annotation.cc (and their tests). That
  includes the CollectDeviceInfo, CollectDeviceAnnotationOps and
  ContextAnalysis entry points. These are all subsumed by the
  PlanDevices pass and the device aware visitors.
- The following passes now use the new 'Device Aware' visitors to
  recover the device for every Relay sub-expression:
     - backend/aot_executor_codegen.cc (AOTOnDemandAllocator)
     - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc)
     - backend/te_compiler.cc (LowerTensorExprMutator)
     - transforms/memory_alloc.cc (DialectRewriter)
     - backend/vm/compiler.cc (VMFunctionCompiler)
- The following passes/utils must maintain the device information
  encoded by the device planner within "on_device" annotations and
  "param_device_types"/"result_device_type" function attributes:
     - backend/vm/lambda_lift.cc (LambdaLifter)
     - transforms/to_a_normal_form.cc (Fill)
     - ir/expr_functior.cc (Bind)
- Remove a lot ad-hoc 'homogeneous' vs 'hetrogeneous' conditionals
  in favor of just asking for the device. Also removed a lot of ad-doc
  encodings of the 'default' device.
- We no longer need to run device-planning twice (before and after
  lowering). Device planning is also decoupled from memory planning.
- The LowerTE Pass no longer needs an expression-to-device side table
  (which was the problem which kicked this series of PRs off in the first place).

* [checkpoint] Revert unnecessary changes

- Started down multi-target handling in interpreter but didn't finish
- Some one-off debug stuff

* [checkpoint] TODO's for default device logic
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022
* Prepare for new plan_devices.cc (part II)

These changes came from changing apache#9038 to use
tvm.parser.fromtext instead of manual AST construction.

- Demote FunctionOnDeviceAttrs to just a pair of DictAttrs entries so
  that the parser will understand them on Function definitions.
- Connect some special operators to their attributes so parsing understands them
  at call sites.
- Don't silently ignore attributes during parsing.
- Implement OptFunctionOnDevice so won't add device annotations for kUnknownDeviceType.
- Allow the parser to be given an initial metadata map to support examples which
  need constants.
- More DLOG -> VLOG conversions to reduce debug clutter.

* [checkpoint] Keep existing ParseModule ffi to simplify rust bindings

* [checkpoint] Address Christopher's comments.

* [checkpoint] Andrew's comments from apache#9038

* [checkpoint] Jared's comments from apache#9038

* [checkpoint] Woops, forgot rename.
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022
…tation.cc (apache#9038)

* [Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc

Currently LowerTEPass (backend/te_compiler.cc) is a 'special' pass because it
depends on a side-input DeviceMap. We'd like to remove that side-input, and
instead recover the Device (and, ultimately, Target) for each (fused) primitive
call from the AST alone.

By doing so we also avoid needing to perform device planning twice:
 - It needs to be done before lowering so we know which primitives need
   to be compiled for which devices.
 - It then needs to be re-done after lowering and optimization as a prelude
   to memory planning.
By baking the device plan into the AST we can simply do device planning before
lowering, and run memory planning later, both as ordinary passes.

While working on that issue we realized we currently have 3 'device planners':
 - transforms/device_annotation.cc, which supports only a small subset of Relay
   and uses a simple top-down algorithm to assign a device to every
   sub-expression.
 - analysis/context_analysis.cc, which makes a galant effort to support most of
   Relay, is based on unification rather than a top-down algorithm, but handles
   higher order functions by ad-hoc and fragile inlining.
 - transforms/annotate_target.cc, which works on Targets instead of Devices, but
   is otherwise like 'device planning'.
We'd like to bring these together.

In this PR we introduce a new transforms/device_planner.cc intended to replace
transforms/device_annotation.cc and analysis/context_analysis.cc. We don't
delete those two just yet since we need to switch all users off of them in the
next PR. We also leave transforms/annotate_target.cc alone pending a proper RFC
to bring devices and targets together sensibly, but have it firmly in our
sights.

transforms/device_planner.cc is based on analysis/context_analysis.cc, but
is heavily reworked to:
 1. Handle starting from existing "on_device" annotations as well as existing
    "device_copy" calls.
 2. Be idempotent, with the idea we'll probably need to re-run it to 'refine'
    device planning to account for storge scopes.
 3. Robustly handle all of Relay, particularly higher-order functions. For that
    we replace the inlining approach in analysis/context_analysis.cc with a
    higher-order unification domain.
 4. Be a little more systematic with defaulting.
 5. Capture the result of the analysis within the AST as new "device_copy" calls
    at device boundaries, and new/replaced "on_device" calls wherever the device
    for a sub-expression is not already 'obvious' from the sub-expression's
    lexical scope.
 6. Provide helper visitors for passes which need to ask for the device for
    any sub-expression they are processing and/or preserve device information
    on rewrites. Those passes include:
     - backend/aot_executor_codegen.cc (AOTOnDemandAllocator)
     - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc)
     - backend/te_compiler.cc (LowerTensorExprMutator)
     - backend/vm/lambda_lift.cc (LambdaLifter)
     - transforms/memory_alloc.cc (DialectRewriter)
     - transforms/to_a_normal_form.cc (Fill)
     - backend/vm/compiler.cc (VMFunctionCompiler)
    However we won't change any of those in this PR.

See the draft apache#8788 for the end game.

* [checkpoint] Use Relay script for all unit tests.

* [checkpoint] Hoist out DeviceDomain and DeviceDomains.

* [checkpoint] Hoist out visitors

* [checkpoint] Woops, left debug-only code in
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022
* [Relay] Switch the graph, VM and AOT executors to use the merged
device_planner.cc from apache#9038, and finally remove DeviceMap from the
LowerTE Pass.

- We retire analysis/context_analysis.cc and
  transforms/device_annotation.cc (and their tests). That
  includes the CollectDeviceInfo, CollectDeviceAnnotationOps and
  ContextAnalysis entry points. These are all subsumed by the
  PlanDevices pass and the device aware visitors.
- The following passes now use the new 'Device Aware' visitors to
  recover the device for every Relay sub-expression:
     - backend/aot_executor_codegen.cc (AOTOnDemandAllocator)
     - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc)
     - backend/te_compiler.cc (LowerTensorExprMutator)
     - transforms/memory_alloc.cc (DialectRewriter)
     - backend/vm/compiler.cc (VMFunctionCompiler)
- The following passes/utils must maintain the device information
  encoded by the device planner within "on_device" annotations and
  "param_device_types"/"result_device_type" function attributes:
     - backend/vm/lambda_lift.cc (LambdaLifter)
     - transforms/to_a_normal_form.cc (Fill)
     - ir/expr_functior.cc (Bind)
- Remove a lot ad-hoc 'homogeneous' vs 'hetrogeneous' conditionals
  in favor of just asking for the device. Also removed a lot of ad-doc
  encodings of the 'default' device.
- We no longer need to run device-planning twice (before and after
  lowering). Device planning is also decoupled from memory planning.
- The LowerTE Pass no longer needs an expression-to-device side table
  (which was the problem which kicked this series of PRs off in the first place).

* [checkpoint] Revert unnecessary changes

- Started down multi-target handling in interpreter but didn't finish
- Some one-off debug stuff

* [checkpoint] TODO's for default device logic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants