-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collage RFC #62
Collage RFC #62
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great proposal. My first comments are all more or less future possibilities.
I will try to sync up with the UMA proposal, and add a second round of comments ASAP.
@manupa-arm would be great to get your view on this too |
Hey all, heads up I'm taking a close look at the 'pass ordering' problem hiding in this RFC. That is, as written and prototyped, CollageFuseOps runs just before the current FuseOps so that it can 'see' rewrites which guide TVM's native fusion rules. However some of those rewrites are target specific, and some are very oriented towards TVM and would likely interfere with existing BYOC patterns and custom lowering functions. So we want CollageFuseOps to run both before and after quite a few passes and something's got to give. I'll see if I can put a clear strawman in place to address this. I have a good idea based on conversations with Matthew. |
- Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite *** CAUTION: Almost certainly broken *** - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite *** CAUTION: Almost certainly broken *** - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
Bumped version to 0.8 based on extensive reworking due to excellent comments above -- thanks!. I still need to expand the PartitionRule section, should be ready tomorrow. |
PTAL all, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome RFC, Mark! It seems like all comments so far have been well addressed. If there are no additional comments in the next few days, we'll officially approve and merge this RFC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mbs-octoml for the detailed RFC.
I did a first pass. There is lot to unpack here.
I think we would need to follow the template for this RFC : https://github.com/apache/tvm-rfcs/blob/main/0000-template.md
This will allow the readers to focus on UX/UI implications and technical deeper-details separately. I suppose just moving the content around to match the template is what's required here.
I have left comments. questions -- mostly to understand what is being proposed.
I have few broad questions :
1.) How can a user export the search done by collage ? i.e. similar to loading tuning logs where ApplyBestHistory is done.
2.) Is there a model being developed to estimate latency of the chosen partitioning? or is it always need to run on the actual hardware ? -- it would be great explore if we can abstract this at that at level so a user could plugin a Relay/TIR based performance model to avoid lengthy "tuning" time.
Having thought about this more, If there can be a way to define a PartitionSpec that is tied with DFS IndexSet that could be exported and imported, that might work out. I guess my question is there a way to reproduce the partitioning w/o needing to undergo a tuning phase everytime. If that is so from a design perspective, it would be better to decouple the collage tuner and collage partitioner, one to produce exportable paritioning spec and one to consume such a partitioning spec, respectively. Let me know if you have any thoughts around this :) @mbaret @mbs-octoml |
No. As far as Collage is concerned it just calls the abstract CostEstimator::Estimate interface for each candidate partition, and can remain ignorant as to where those costs come from. In the prototype it is hard coded to tune, build and run locally to help us get going. Here at OctoML we'll need to instantiate the interface to connect to our production tuning/running/caching systems. In principle the interface could also be instantiated to some analytical model a la the cascading scheduler, though we're not planning to build one. I'm thinking the CostEstimator object can just be passed into the CollagePartitioner to leave this extension point open.
In an early draft I had support for that. Basically it just needs some materialization of the 'optimal' partitioning as an Array of CandidatePartitions. The method CandidatePartition::ParallelRewrite can take that array and rewrite the whole expression, exactly as the Collage pass would have done. So splitting between search and rewrite is pretty easy. But I ended up dropping it all in favor of just relying on the CostEstimator to cache all measurements (which it needs to do anyway given all the sharing opportunities). Firstly, it's not yet clear if there's any significant compile time advantage to bypassing the collage search if every candidate partition to estimate results in a cache hit. I figured I'd at least measure that before adding a fix. But secondly, if someone (the service, the user) is going to go to the trouble of caching the optimal partitioning for a particular (model, targets) pair, why not just cache the built artifact directly and skip all the bother? However let me know if I over simplified and I can add that part back.
Ok. |
- Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite *** CAUTION: Almost certainly broken *** - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite *** CAUTION: Almost certainly broken *** - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite *** CAUTION: Almost certainly broken *** - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- cleanup before rebase - Use 'regular' target when build, not external codegen target - Tuned for -libs=cudnn - Tune before collage not during - Bring over target changes - Fix GetSpecName - Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- cleanup before rebase - Use 'regular' target when build, not external codegen target - Tuned for -libs=cudnn - Tune before collage not during - Bring over target changes - Fix GetSpecName - Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- cleanup before rebase - Use 'regular' target when build, not external codegen target - Tuned for -libs=cudnn - Tune before collage not during - Bring over target changes - Fix GetSpecName - Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- cleanup before rebase - Use 'regular' target when build, not external codegen target - Tuned for -libs=cudnn - Tune before collage not during - Bring over target changes - Fix GetSpecName - Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- cleanup before rebase - Use 'regular' target when build, not external codegen target - Tuned for -libs=cudnn - Tune before collage not during - Bring over target changes - Fix GetSpecName - Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- cleanup before rebase - Use 'regular' target when build, not external codegen target - Tuned for -libs=cudnn - Tune before collage not during - Bring over target changes - Fix GetSpecName - Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- cleanup before rebase - Use 'regular' target when build, not external codegen target - Tuned for -libs=cudnn - Tune before collage not during - Bring over target changes - Fix GetSpecName - Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- cleanup before rebase - Use 'regular' target when build, not external codegen target - Tuned for -libs=cudnn - Tune before collage not during - Bring over target changes - Fix GetSpecName - Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- Host glitch if PlanDevices run before CollagePartition - Fix unit test - Make load_static_library first class python func - Get CUTLASS going on graph executor as well as vm - Include export_library in estimate_seconds - Rollback DSOLibrary changes. - Add StaticLibraryNode and switch CUTLASS to use it This avoids the crazy serialize/deserialize/load hackery, which I'll now remove. - Get running again - CUTLASS picks up all options from 'cutlass' external codegen target. - Revert false starts with cutlass handling - Get CUTLASS going with program-at-a-time tuning and compilation instead of function at a time. - Save DSOLibraries by contents rather than by reference. - futzing with libraries - revert unnecessary cutlass changes - starting unit test for dsolibrary save - Prepare scalar changes for PR. - Eager candidate cost measurement. - More conv2d_cudnn.cuda training records. - cleanup before rebase - Use 'regular' target when build, not external codegen target - Tuned for -libs=cudnn - Tune before collage not during - Bring over target changes - Fix GetSpecName - Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- Host glitch if PlanDevices run before CollagePartition - Fix unit test - Make load_static_library first class python func - Get CUTLASS going on graph executor as well as vm - Include export_library in estimate_seconds - Rollback DSOLibrary changes. - Add StaticLibraryNode and switch CUTLASS to use it This avoids the crazy serialize/deserialize/load hackery, which I'll now remove. - Get running again - CUTLASS picks up all options from 'cutlass' external codegen target. - Revert false starts with cutlass handling - Get CUTLASS going with program-at-a-time tuning and compilation instead of function at a time. - Save DSOLibraries by contents rather than by reference. - futzing with libraries - revert unnecessary cutlass changes - starting unit test for dsolibrary save - Prepare scalar changes for PR. - Eager candidate cost measurement. - More conv2d_cudnn.cuda training records. - cleanup before rebase - Use 'regular' target when build, not external codegen target - Tuned for -libs=cudnn - Tune before collage not during - Bring over target changes - Fix GetSpecName - Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- Polish compiler_function_utils for splitting out - Mark functions as extern. - Get rid of relay.ext.cutlass - kExternalSymbol:String ----> kExtern:Bool - Host glitch if PlanDevices run before CollagePartition - Fix unit test - Make load_static_library first class python func - Get CUTLASS going on graph executor as well as vm - Include export_library in estimate_seconds - Rollback DSOLibrary changes. - Add StaticLibraryNode and switch CUTLASS to use it This avoids the crazy serialize/deserialize/load hackery, which I'll now remove. - Get running again - CUTLASS picks up all options from 'cutlass' external codegen target. - Revert false starts with cutlass handling - Get CUTLASS going with program-at-a-time tuning and compilation instead of function at a time. - Save DSOLibraries by contents rather than by reference. - futzing with libraries - revert unnecessary cutlass changes - starting unit test for dsolibrary save - Prepare scalar changes for PR. - Eager candidate cost measurement. - More conv2d_cudnn.cuda training records. - cleanup before rebase - Use 'regular' target when build, not external codegen target - Tuned for -libs=cudnn - Tune before collage not during - Bring over target changes - Fix GetSpecName - Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- Polish compiler_function_utils for splitting out - Mark functions as extern. - Get rid of relay.ext.cutlass - kExternalSymbol:String ----> kExtern:Bool - Host glitch if PlanDevices run before CollagePartition - Fix unit test - Make load_static_library first class python func - Get CUTLASS going on graph executor as well as vm - Include export_library in estimate_seconds - Rollback DSOLibrary changes. - Add StaticLibraryNode and switch CUTLASS to use it This avoids the crazy serialize/deserialize/load hackery, which I'll now remove. - Get running again - CUTLASS picks up all options from 'cutlass' external codegen target. - Revert false starts with cutlass handling - Get CUTLASS going with program-at-a-time tuning and compilation instead of function at a time. - Save DSOLibraries by contents rather than by reference. - futzing with libraries - revert unnecessary cutlass changes - starting unit test for dsolibrary save - Prepare scalar changes for PR. - Eager candidate cost measurement. - More conv2d_cudnn.cuda training records. - cleanup before rebase - Use 'regular' target when build, not external codegen target - Tuned for -libs=cudnn - Tune before collage not during - Bring over target changes - Fix GetSpecName - Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- Polish compiler_function_utils for splitting out - Mark functions as extern. - Get rid of relay.ext.cutlass - kExternalSymbol:String ----> kExtern:Bool - Host glitch if PlanDevices run before CollagePartition - Fix unit test - Make load_static_library first class python func - Get CUTLASS going on graph executor as well as vm - Include export_library in estimate_seconds - Rollback DSOLibrary changes. - Add StaticLibraryNode and switch CUTLASS to use it This avoids the crazy serialize/deserialize/load hackery, which I'll now remove. - Get running again - CUTLASS picks up all options from 'cutlass' external codegen target. - Revert false starts with cutlass handling - Get CUTLASS going with program-at-a-time tuning and compilation instead of function at a time. - Save DSOLibraries by contents rather than by reference. - futzing with libraries - revert unnecessary cutlass changes - starting unit test for dsolibrary save - Prepare scalar changes for PR. - Eager candidate cost measurement. - More conv2d_cudnn.cuda training records. - cleanup before rebase - Use 'regular' target when build, not external codegen target - Tuned for -libs=cudnn - Tune before collage not during - Bring over target changes - Fix GetSpecName - Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- Polish compiler_function_utils for splitting out - Mark functions as extern. - Get rid of relay.ext.cutlass - kExternalSymbol:String ----> kExtern:Bool - Host glitch if PlanDevices run before CollagePartition - Fix unit test - Make load_static_library first class python func - Get CUTLASS going on graph executor as well as vm - Include export_library in estimate_seconds - Rollback DSOLibrary changes. - Add StaticLibraryNode and switch CUTLASS to use it This avoids the crazy serialize/deserialize/load hackery, which I'll now remove. - Get running again - CUTLASS picks up all options from 'cutlass' external codegen target. - Revert false starts with cutlass handling - Get CUTLASS going with program-at-a-time tuning and compilation instead of function at a time. - Save DSOLibraries by contents rather than by reference. - futzing with libraries - revert unnecessary cutlass changes - starting unit test for dsolibrary save - Prepare scalar changes for PR. - Eager candidate cost measurement. - More conv2d_cudnn.cuda training records. - cleanup before rebase - Use 'regular' target when build, not external codegen target - Tuned for -libs=cudnn - Tune before collage not during - Bring over target changes - Fix GetSpecName - Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- Polish compiler_function_utils for splitting out - Mark functions as extern. - Get rid of relay.ext.cutlass - kExternalSymbol:String ----> kExtern:Bool - Host glitch if PlanDevices run before CollagePartition - Fix unit test - Make load_static_library first class python func - Get CUTLASS going on graph executor as well as vm - Include export_library in estimate_seconds - Rollback DSOLibrary changes. - Add StaticLibraryNode and switch CUTLASS to use it This avoids the crazy serialize/deserialize/load hackery, which I'll now remove. - Get running again - CUTLASS picks up all options from 'cutlass' external codegen target. - Revert false starts with cutlass handling - Get CUTLASS going with program-at-a-time tuning and compilation instead of function at a time. - Save DSOLibraries by contents rather than by reference. - futzing with libraries - revert unnecessary cutlass changes - starting unit test for dsolibrary save - Prepare scalar changes for PR. - Eager candidate cost measurement. - More conv2d_cudnn.cuda training records. - cleanup before rebase - Use 'regular' target when build, not external codegen target - Tuned for -libs=cudnn - Tune before collage not during - Bring over target changes - Fix GetSpecName - Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- Polish compiler_function_utils for splitting out - Mark functions as extern. - Get rid of relay.ext.cutlass - kExternalSymbol:String ----> kExtern:Bool - Host glitch if PlanDevices run before CollagePartition - Fix unit test - Make load_static_library first class python func - Get CUTLASS going on graph executor as well as vm - Include export_library in estimate_seconds - Rollback DSOLibrary changes. - Add StaticLibraryNode and switch CUTLASS to use it This avoids the crazy serialize/deserialize/load hackery, which I'll now remove. - Get running again - CUTLASS picks up all options from 'cutlass' external codegen target. - Revert false starts with cutlass handling - Get CUTLASS going with program-at-a-time tuning and compilation instead of function at a time. - Save DSOLibraries by contents rather than by reference. - futzing with libraries - revert unnecessary cutlass changes - starting unit test for dsolibrary save - Prepare scalar changes for PR. - Eager candidate cost measurement. - More conv2d_cudnn.cuda training records. - cleanup before rebase - Use 'regular' target when build, not external codegen target - Tuned for -libs=cudnn - Tune before collage not during - Bring over target changes - Fix GetSpecName - Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- Polish compiler_function_utils for splitting out - Mark functions as extern. - Get rid of relay.ext.cutlass - kExternalSymbol:String ----> kExtern:Bool - Host glitch if PlanDevices run before CollagePartition - Fix unit test - Make load_static_library first class python func - Get CUTLASS going on graph executor as well as vm - Include export_library in estimate_seconds - Rollback DSOLibrary changes. - Add StaticLibraryNode and switch CUTLASS to use it This avoids the crazy serialize/deserialize/load hackery, which I'll now remove. - Get running again - CUTLASS picks up all options from 'cutlass' external codegen target. - Revert false starts with cutlass handling - Get CUTLASS going with program-at-a-time tuning and compilation instead of function at a time. - Save DSOLibraries by contents rather than by reference. - futzing with libraries - revert unnecessary cutlass changes - starting unit test for dsolibrary save - Prepare scalar changes for PR. - Eager candidate cost measurement. - More conv2d_cudnn.cuda training records. - cleanup before rebase - Use 'regular' target when build, not external codegen target - Tuned for -libs=cudnn - Tune before collage not during - Bring over target changes - Fix GetSpecName - Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- Fix rebase - Prepare for rebase - Move CaptureIndexInSpans to generic tvm.relay.transform - Fix test_sub_graph.py unit tests - Make PartitionSpecs 1:1 with Targets - Fix tests - Finish merging Matthew's changes - First pass merging Matthew's changes - finish fixing lints - test_tensorrt.py runs - some lint fixes while waiting - test annotation fiddles, disable pytorch test - fix constant handling - update tests for new API - Switch TensorRT BYOC integration to IRModule-at-a-time - [bug] index out of range - don't need InferTypeExpr - revert unnecessary changes - revert unnecessary changes - fix accumulate bug - sync with 11481 - Eta-expand tuple ars in candidate partitions (so measurements does not need to worry about constructing tuple arguments) - Polish compiler_function_utils for splitting out - Mark functions as extern. - Get rid of relay.ext.cutlass - kExternalSymbol:String ----> kExtern:Bool - Host glitch if PlanDevices run before CollagePartition - Fix unit test - Make load_static_library first class python func - Get CUTLASS going on graph executor as well as vm - Include export_library in estimate_seconds - Rollback DSOLibrary changes. - Add StaticLibraryNode and switch CUTLASS to use it This avoids the crazy serialize/deserialize/load hackery, which I'll now remove. - Get running again - CUTLASS picks up all options from 'cutlass' external codegen target. - Revert false starts with cutlass handling - Get CUTLASS going with program-at-a-time tuning and compilation instead of function at a time. - Save DSOLibraries by contents rather than by reference. - futzing with libraries - revert unnecessary cutlass changes - starting unit test for dsolibrary save - Prepare scalar changes for PR. - Eager candidate cost measurement. - More conv2d_cudnn.cuda training records. - cleanup before rebase - Use 'regular' target when build, not external codegen target - Tuned for -libs=cudnn - Tune before collage not during - Bring over target changes - Fix GetSpecName - Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- Fix rebase - Prepare for rebase - Move CaptureIndexInSpans to generic tvm.relay.transform - Fix test_sub_graph.py unit tests - Make PartitionSpecs 1:1 with Targets - Fix tests - Finish merging Matthew's changes - First pass merging Matthew's changes - finish fixing lints - test_tensorrt.py runs - some lint fixes while waiting - test annotation fiddles, disable pytorch test - fix constant handling - update tests for new API - Switch TensorRT BYOC integration to IRModule-at-a-time - [bug] index out of range - don't need InferTypeExpr - revert unnecessary changes - revert unnecessary changes - fix accumulate bug - sync with 11481 - Eta-expand tuple ars in candidate partitions (so measurements does not need to worry about constructing tuple arguments) - Polish compiler_function_utils for splitting out - Mark functions as extern. - Get rid of relay.ext.cutlass - kExternalSymbol:String ----> kExtern:Bool - Host glitch if PlanDevices run before CollagePartition - Fix unit test - Make load_static_library first class python func - Get CUTLASS going on graph executor as well as vm - Include export_library in estimate_seconds - Rollback DSOLibrary changes. - Add StaticLibraryNode and switch CUTLASS to use it This avoids the crazy serialize/deserialize/load hackery, which I'll now remove. - Get running again - CUTLASS picks up all options from 'cutlass' external codegen target. - Revert false starts with cutlass handling - Get CUTLASS going with program-at-a-time tuning and compilation instead of function at a time. - Save DSOLibraries by contents rather than by reference. - futzing with libraries - revert unnecessary cutlass changes - starting unit test for dsolibrary save - Prepare scalar changes for PR. - Eager candidate cost measurement. - More conv2d_cudnn.cuda training records. - cleanup before rebase - Use 'regular' target when build, not external codegen target - Tuned for -libs=cudnn - Tune before collage not during - Bring over target changes - Fix GetSpecName - Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- Fix rebase - Prepare for rebase - Move CaptureIndexInSpans to generic tvm.relay.transform - Fix test_sub_graph.py unit tests - Make PartitionSpecs 1:1 with Targets - Fix tests - Finish merging Matthew's changes - First pass merging Matthew's changes - finish fixing lints - test_tensorrt.py runs - some lint fixes while waiting - test annotation fiddles, disable pytorch test - fix constant handling - update tests for new API - Switch TensorRT BYOC integration to IRModule-at-a-time - [bug] index out of range - don't need InferTypeExpr - revert unnecessary changes - revert unnecessary changes - fix accumulate bug - sync with 11481 - Eta-expand tuple ars in candidate partitions (so measurements does not need to worry about constructing tuple arguments) - Polish compiler_function_utils for splitting out - Mark functions as extern. - Get rid of relay.ext.cutlass - kExternalSymbol:String ----> kExtern:Bool - Host glitch if PlanDevices run before CollagePartition - Fix unit test - Make load_static_library first class python func - Get CUTLASS going on graph executor as well as vm - Include export_library in estimate_seconds - Rollback DSOLibrary changes. - Add StaticLibraryNode and switch CUTLASS to use it This avoids the crazy serialize/deserialize/load hackery, which I'll now remove. - Get running again - CUTLASS picks up all options from 'cutlass' external codegen target. - Revert false starts with cutlass handling - Get CUTLASS going with program-at-a-time tuning and compilation instead of function at a time. - Save DSOLibraries by contents rather than by reference. - futzing with libraries - revert unnecessary cutlass changes - starting unit test for dsolibrary save - Prepare scalar changes for PR. - Eager candidate cost measurement. - More conv2d_cudnn.cuda training records. - cleanup before rebase - Use 'regular' target when build, not external codegen target - Tuned for -libs=cudnn - Tune before collage not during - Bring over target changes - Fix GetSpecName - Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- Fix rebase - Prepare for rebase - Move CaptureIndexInSpans to generic tvm.relay.transform - Fix test_sub_graph.py unit tests - Make PartitionSpecs 1:1 with Targets - Fix tests - Finish merging Matthew's changes - First pass merging Matthew's changes - finish fixing lints - test_tensorrt.py runs - some lint fixes while waiting - test annotation fiddles, disable pytorch test - fix constant handling - update tests for new API - Switch TensorRT BYOC integration to IRModule-at-a-time - [bug] index out of range - don't need InferTypeExpr - revert unnecessary changes - revert unnecessary changes - fix accumulate bug - sync with 11481 - Eta-expand tuple ars in candidate partitions (so measurements does not need to worry about constructing tuple arguments) - Polish compiler_function_utils for splitting out - Mark functions as extern. - Get rid of relay.ext.cutlass - kExternalSymbol:String ----> kExtern:Bool - Host glitch if PlanDevices run before CollagePartition - Fix unit test - Make load_static_library first class python func - Get CUTLASS going on graph executor as well as vm - Include export_library in estimate_seconds - Rollback DSOLibrary changes. - Add StaticLibraryNode and switch CUTLASS to use it This avoids the crazy serialize/deserialize/load hackery, which I'll now remove. - Get running again - CUTLASS picks up all options from 'cutlass' external codegen target. - Revert false starts with cutlass handling - Get CUTLASS going with program-at-a-time tuning and compilation instead of function at a time. - Save DSOLibraries by contents rather than by reference. - futzing with libraries - revert unnecessary cutlass changes - starting unit test for dsolibrary save - Prepare scalar changes for PR. - Eager candidate cost measurement. - More conv2d_cudnn.cuda training records. - cleanup before rebase - Use 'regular' target when build, not external codegen target - Tuned for -libs=cudnn - Tune before collage not during - Bring over target changes - Fix GetSpecName - Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- starting to add CombinerRule unit tests - sync with mbs-collage-subgraph changes - rebase - sync - Clarify dataflow_graph.expr() vs expr constraints - Beef up test_sub_graph - Polish - False alarm, reverting unnecessary const fiddles - Bad merge, still have bug with missing const. - Fix rebase - Prepare for rebase - Move CaptureIndexInSpans to generic tvm.relay.transform - Fix test_sub_graph.py unit tests - Make PartitionSpecs 1:1 with Targets - Fix tests - Finish merging Matthew's changes - First pass merging Matthew's changes - finish fixing lints - test_tensorrt.py runs - some lint fixes while waiting - test annotation fiddles, disable pytorch test - fix constant handling - update tests for new API - Switch TensorRT BYOC integration to IRModule-at-a-time - [bug] index out of range - don't need InferTypeExpr - revert unnecessary changes - revert unnecessary changes - fix accumulate bug - sync with 11481 - Eta-expand tuple ars in candidate partitions (so measurements does not need to worry about constructing tuple arguments) - Polish compiler_function_utils for splitting out - Mark functions as extern. - Get rid of relay.ext.cutlass - kExternalSymbol:String ----> kExtern:Bool - Host glitch if PlanDevices run before CollagePartition - Fix unit test - Make load_static_library first class python func - Get CUTLASS going on graph executor as well as vm - Include export_library in estimate_seconds - Rollback DSOLibrary changes. - Add StaticLibraryNode and switch CUTLASS to use it This avoids the crazy serialize/deserialize/load hackery, which I'll now remove. - Get running again - CUTLASS picks up all options from 'cutlass' external codegen target. - Revert false starts with cutlass handling - Get CUTLASS going with program-at-a-time tuning and compilation instead of function at a time. - Save DSOLibraries by contents rather than by reference. - futzing with libraries - revert unnecessary cutlass changes - starting unit test for dsolibrary save - Prepare scalar changes for PR. - Eager candidate cost measurement. - More conv2d_cudnn.cuda training records. - cleanup before rebase - Use 'regular' target when build, not external codegen target - Tuned for -libs=cudnn - Tune before collage not during - Bring over target changes - Fix GetSpecName - Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- Test ByKind CombinerRule - Move the TOpPattern attributes from Python to C++ so visible to C++ unit tests. - wibble - wibble - starting to add CombinerRule unit tests - sync with mbs-collage-subgraph changes - rebase - sync - Clarify dataflow_graph.expr() vs expr constraints - Beef up test_sub_graph - Polish - False alarm, reverting unnecessary const fiddles - Bad merge, still have bug with missing const. - Fix rebase - Prepare for rebase - Move CaptureIndexInSpans to generic tvm.relay.transform - Fix test_sub_graph.py unit tests - Make PartitionSpecs 1:1 with Targets - Fix tests - Finish merging Matthew's changes - First pass merging Matthew's changes - finish fixing lints - test_tensorrt.py runs - some lint fixes while waiting - test annotation fiddles, disable pytorch test - fix constant handling - update tests for new API - Switch TensorRT BYOC integration to IRModule-at-a-time - [bug] index out of range - don't need InferTypeExpr - revert unnecessary changes - revert unnecessary changes - fix accumulate bug - sync with 11481 - Eta-expand tuple ars in candidate partitions (so measurements does not need to worry about constructing tuple arguments) - Polish compiler_function_utils for splitting out - Mark functions as extern. - Get rid of relay.ext.cutlass - kExternalSymbol:String ----> kExtern:Bool - Host glitch if PlanDevices run before CollagePartition - Fix unit test - Make load_static_library first class python func - Get CUTLASS going on graph executor as well as vm - Include export_library in estimate_seconds - Rollback DSOLibrary changes. - Add StaticLibraryNode and switch CUTLASS to use it This avoids the crazy serialize/deserialize/load hackery, which I'll now remove. - Get running again - CUTLASS picks up all options from 'cutlass' external codegen target. - Revert false starts with cutlass handling - Get CUTLASS going with program-at-a-time tuning and compilation instead of function at a time. - Save DSOLibraries by contents rather than by reference. - futzing with libraries - revert unnecessary cutlass changes - starting unit test for dsolibrary save - Prepare scalar changes for PR. - Eager candidate cost measurement. - More conv2d_cudnn.cuda training records. - cleanup before rebase - Use 'regular' target when build, not external codegen target - Tuned for -libs=cudnn - Tune before collage not during - Bring over target changes - Fix GetSpecName - Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
- VLOG in vm runner - lints - Get test_pass_collage_partition.py going - one more rollback - fix relay.collage ffi prefix. - zap all unnecessary changes - test (and minor cleanup) of CandidatePartition::EstimateCost - More partition rule tests - tuple arg test - Test ByKind CombinerRule - Move the TOpPattern attributes from Python to C++ so visible to C++ unit tests. - wibble - wibble - starting to add CombinerRule unit tests - sync with mbs-collage-subgraph changes - rebase - sync - Clarify dataflow_graph.expr() vs expr constraints - Beef up test_sub_graph - Polish - False alarm, reverting unnecessary const fiddles - Bad merge, still have bug with missing const. - Fix rebase - Prepare for rebase - Move CaptureIndexInSpans to generic tvm.relay.transform - Fix test_sub_graph.py unit tests - Make PartitionSpecs 1:1 with Targets - Fix tests - Finish merging Matthew's changes - First pass merging Matthew's changes - finish fixing lints - test_tensorrt.py runs - some lint fixes while waiting - test annotation fiddles, disable pytorch test - fix constant handling - update tests for new API - Switch TensorRT BYOC integration to IRModule-at-a-time - [bug] index out of range - don't need InferTypeExpr - revert unnecessary changes - revert unnecessary changes - fix accumulate bug - sync with 11481 - Eta-expand tuple ars in candidate partitions (so measurements does not need to worry about constructing tuple arguments) - Polish compiler_function_utils for splitting out - Mark functions as extern. - Get rid of relay.ext.cutlass - kExternalSymbol:String ----> kExtern:Bool - Host glitch if PlanDevices run before CollagePartition - Fix unit test - Make load_static_library first class python func - Get CUTLASS going on graph executor as well as vm - Include export_library in estimate_seconds - Rollback DSOLibrary changes. - Add StaticLibraryNode and switch CUTLASS to use it This avoids the crazy serialize/deserialize/load hackery, which I'll now remove. - Get running again - CUTLASS picks up all options from 'cutlass' external codegen target. - Revert false starts with cutlass handling - Get CUTLASS going with program-at-a-time tuning and compilation instead of function at a time. - Save DSOLibraries by contents rather than by reference. - futzing with libraries - revert unnecessary cutlass changes - starting unit test for dsolibrary save - Prepare scalar changes for PR. - Eager candidate cost measurement. - More conv2d_cudnn.cuda training records. - cleanup before rebase - Use 'regular' target when build, not external codegen target - Tuned for -libs=cudnn - Tune before collage not during - Bring over target changes - Fix GetSpecName - Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
In rendered form:
https://github.com/mbs-octoml/mbs-tvm-rfcs/blob/mbs-rfcs-collage/rfcs/0062-collage.md