** Collage v2 sketch ***

- Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite *** CAUTION: Almost certainly broken *** - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
mbs-octoml · Apr 20, 2022 · d03f187 · d03f187
1 parent 68beae9
commit d03f187
Show file tree

Hide file tree

Showing 98 changed files with 9,962 additions and 886 deletions.
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -291,6 +291,7 @@ tvm_file_glob(GLOB_RECURSE RELAY_OP_SRCS
     )
 tvm_file_glob(GLOB_RECURSE RELAY_PASS_SRCS
     src/relay/analysis/*.cc
+    src/relay/collage/*.cc
     src/relay/transforms/*.cc
     src/relay/quantize/*.cc
     )

diff --git a/collage_autotvm.tuninglog b/collage_autotvm.tuninglog
@@ -0,0 +1,15 @@
+{"input": ["cuda -keys=cuda,gpu -arch=sm_86 -max_num_threads=1024 -thread_warp_size=32", "conv2d_nchw_winograd.cuda", [["TENSOR", [1, 1, 32, 32], "float32"], ["TENSOR", [8, 1, 5, 5], "float32"], [1, 1], [0, 0, 0, 0], [1, 1], "float32"], {}], "config": {"index": 968, "code_hash": null, "entity": [["tile_b", "sp", [-1, 1, 1, 1]], ["tile_y", "sp", [-1, 2, 4, 1]], ["tile_x", "sp", [-1, 1, 7, 7]], ["tile_rc", "sp", [-1, 1]], ["auto_unroll_max_step", "ot", 128], ["unroll_explicit", "ot", 1]]}, "result": [[1000000000.0], 6, 10, 1648166365.035291], "version": 0.2, "tvm_version": "0.9.dev0"}
+{"input": ["cuda -keys=cuda,gpu -arch=sm_86 -max_num_threads=1024 -thread_warp_size=32", "conv2d_nchw.cuda", [["TENSOR", [1, 1, 32, 32], "float32"], ["TENSOR", [8, 1, 5, 5], "float32"], [1, 1], [0, 0, 0, 0], [1, 1], "float32"], {}], "config": {"index": 748547, "code_hash": null, "entity": [["tile_f", "sp", [-1, 1, 4, 1]], ["tile_y", "sp", [-1, 1, 1, 4]], ["tile_x", "sp", [-1, 1, 14, 1]], ["tile_rc", "sp", [-1, 1]], ["tile_ry", "sp", [-1, 5]], ["tile_rx", "sp", [-1, 5]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 1]]}, "result": [[2.1807114592422733e-06, 2.182203281316585e-06, 2.183491385782991e-06], 0, 1.8035461902618408, 1648233194.5253587], "version": 0.2, "tvm_version": "0.9.dev0"}
+{"input": ["cuda -keys=cuda,gpu -arch=sm_86 -max_num_threads=1024 -thread_warp_size=32", "conv2d_nchw_winograd.cuda", [["TENSOR", [1, 8, 18, 18], "float32"], ["TENSOR", [16, 8, 5, 5], "float32"], [1, 1], [0, 0, 0, 0], [1, 1], "float32"], {}], "config": {"index": 7905, "code_hash": null, "entity": [["tile_b", "sp", [-1, 1, 1, 1]], ["tile_y", "sp", [-1, 1, 4, 4]], ["tile_x", "sp", [-1, 1, 49, 1]], ["tile_rc", "sp", [-1, 4]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 1]]}, "result": [[1.4285206158127155e-05, 1.4285846107313532e-05, 1.4331592281168714e-05], 0, 7.421089172363281, 1648237434.129], "version": 0.2, "tvm_version": "0.9.dev0"}
+{"input": ["cuda -keys=cuda,gpu -arch=sm_86 -max_num_threads=1024 -thread_warp_size=32", "conv2d_nchw.cuda", [["TENSOR", [1, 8, 18, 18], "float32"], ["TENSOR", [16, 8, 5, 5], "float32"], [1, 1], [0, 0, 0, 0], [1, 1], "float32"], {}], "config": {"index": 714012, "code_hash": null, "entity": [["tile_f", "sp", [-1, 1, 8, 1]], ["tile_y", "sp", [-1, 1, 1, 1]], ["tile_x", "sp", [-1, 1, 7, 2]], ["tile_rc", "sp", [-1, 8]], ["tile_ry", "sp", [-1, 5]], ["tile_rx", "sp", [-1, 5]], ["auto_unroll_max_step", "ot", 512], ["unroll_explicit", "ot", 1]]}, "result": [[2.5586838960333487e-06, 2.5701070606157226e-06, 2.572374535019662e-06], 0, 3.1794843673706055, 1648239614.7956486], "version": 0.2, "tvm_version": "0.9.dev0"}
+{"input": ["cuda -keys=cuda,gpu -arch=sm_86 -max_num_threads=1024 -thread_warp_size=32", "dense_small_batch.gpu", [["TENSOR", [1, 256], "float32"], ["TENSOR", [10, 256], "float32"], null, "float32"], {}], "config": {"index": 4, "code_hash": null, "entity": [["tile_k", "sp", [-1, 16]]]}, "result": [[2.158152404676017e-06, 2.1645748896629425e-06, 2.1784918293729133e-06], 0, 1.6369056701660156, 1648241555.184448], "version": 0.2, "tvm_version": "0.9.dev0"}
+{"input": ["cuda -keys=cuda,gpu -arch=sm_86 -max_num_threads=1024 -thread_warp_size=32", "dense_large_batch.gpu", [["TENSOR", [1600, 768], "float32"], ["TENSOR", [2304, 768], "float32"], null, "float32"], {}], "config": {"index": 61851361, "code_hash": null, "entity": [["tile_x", "sp", [-1, 2, 2, 8]], ["tile_y", "sp", [-1, 1, 2, 9]], ["tile_k", "sp", [-1, 2, 4]]]}, "result": [[0.004074227972972973, 0.0040861373243243244, 0.004086151648648648], 0, 3.037601947784424, 1648251189.6885986], "version": 0.2, "tvm_version": "0.9.dev0"}
+{"input": ["cuda -keys=cuda,gpu -arch=sm_86 -max_num_threads=1024 -thread_warp_size=32", "dense_small_batch.gpu", [["TENSOR", [1600, 768], "float32"], ["TENSOR", [2304, 768], "float32"], null, "float32"], {}], "config": {"index": 5, "code_hash": null, "entity": [["tile_k", "sp", [-1, 8]]]}, "result": [[0.0268318398, 0.026832641350000002, 0.02683273135], 0, 4.179340600967407, 1648254281.8060668], "version": 0.2, "tvm_version": "0.9.dev0"}
+{"input": ["cuda -keys=cuda,gpu -arch=sm_86 -max_num_threads=1024 -thread_warp_size=32", "batch_matmul.cuda", [["TENSOR", [600, 32, 64], "float32"], ["TENSOR", [600, 32, 64], "float32"], [600, 32, 32], "float32", 0, 1], {}], "config": {"index": 20386, "code_hash": null, "entity": [["tile_y", "sp", [-1, 2, 8]], ["tile_x", "sp", [-1, 16, 1]], ["tile_k", "sp", [-1, 16]], ["auto_unroll_max_step", "ot", 32], ["unroll_explicit", "ot", 1]]}, "result": [[3.258110773592547e-05, 3.258372944511948e-05, 3.261549426218442e-05], 0, 2.397996664047241, 1648255266.3718677], "version": 0.2, "tvm_version": "0.9.dev0"}
+{"input": ["cuda -keys=cuda,gpu -arch=sm_86 -max_num_threads=1024 -thread_warp_size=32", "batch_matmul.cuda", [["TENSOR", [600, 32, 32], "float32"], ["TENSOR", [600, 64, 32], "float32"], [600, 32, 64], "float32", 0, 1], {}], "config": {"index": 5980, "code_hash": null, "entity": [["tile_y", "sp", [-1, 2, 8]], ["tile_x", "sp", [-1, 16, 1]], ["tile_k", "sp", [-1, 16]], ["auto_unroll_max_step", "ot", 16], ["unroll_explicit", "ot", 0]]}, "result": [[3.199404780823732e-05, 3.199749384187525e-05, 3.200219666269368e-05], 0, 2.3573713302612305, 1648257050.9987426], "version": 0.2, "tvm_version": "0.9.dev0"}
+{"input": ["cuda -keys=cuda,gpu -arch=sm_86 -max_num_threads=1024 -thread_warp_size=32", "dense_large_batch.gpu", [["TENSOR", [1600, 768], "float32"], ["TENSOR", [768, 768], "float32"], null, "float32"], {}], "config": {"index": 13482935, "code_hash": null, "entity": [["tile_x", "sp", [-1, 5, 16, 1]], ["tile_y", "sp", [-1, 4, 16, 2]], ["tile_k", "sp", [-1, 12, 2]]]}, "result": [[0.00026185516898148144, 0.00026186912731481486, 0.0002643642638888889], 0, 5.9183220863342285, 1648262140.4419408], "version": 0.2, "tvm_version": "0.9.dev0"}
+{"input": ["cuda -keys=cuda,gpu -arch=sm_86 -max_num_threads=1024 -thread_warp_size=32", "dense_small_batch.gpu", [["TENSOR", [1600, 768], "float32"], ["TENSOR", [768, 768], "float32"], null, "float32"], {}], "config": {"index": 9, "code_hash": null, "entity": [["tile_k", "sp", [-1, 32]]]}, "result": [[0.0022258066376811595, 0.0022258676666666666, 0.0022260689855072464], 0, 1.6845574378967285, 1648264221.272429], "version": 0.2, "tvm_version": "0.9.dev0"}
+{"input": ["cuda -keys=cuda,gpu -arch=sm_86 -max_num_threads=1024 -thread_warp_size=32", "dense_large_batch.gpu", [["TENSOR", [1600, 768], "float32"], ["TENSOR", [3072, 768], "float32"], null, "float32"], {}], "config": {"index": 75386735, "code_hash": null, "entity": [["tile_x", "sp", [-1, 5, 16, 1]], ["tile_y", "sp", [-1, 2, 16, 4]], ["tile_k", "sp", [-1, 2, 12]]]}, "result": [[0.0009476383928571428, 0.0009476764880952381, 0.0009480008333333333], 0, 3.346571207046509, 1648271350.9854434], "version": 0.2, "tvm_version": "0.9.dev0"}
+{"input": ["cuda -keys=cuda,gpu -arch=sm_86 -max_num_threads=1024 -thread_warp_size=32", "dense_small_batch.gpu", [["TENSOR", [1600, 768], "float32"], ["TENSOR", [3072, 768], "float32"], null, "float32"], {}], "config": {"index": 17, "code_hash": null, "entity": [["tile_k", "sp", [-1, 768]]]}, "result": [[1000000000.0], 4, 4.362995386123657, 1648274146.1389868], "version": 0.2, "tvm_version": "0.9.dev0"}
+{"input": ["cuda -keys=cuda,gpu -arch=sm_86 -max_num_threads=1024 -thread_warp_size=32", "dense_large_batch.gpu", [["TENSOR", [1600, 3072], "float32"], ["TENSOR", [768, 3072], "float32"], null, "float32"], {}], "config": {"index": 15171048, "code_hash": null, "entity": [["tile_x", "sp", [-1, 5, 4, 20]], ["tile_y", "sp", [-1, 1, 192, 2]], ["tile_k", "sp", [-1, 8, 2]]]}, "result": [[1000000000.0], 1, 1.2985179424285889, 1648274382.1135368], "version": 0.2, "tvm_version": "0.9.dev0"}
+{"input": ["cuda -keys=cuda,gpu -arch=sm_86 -max_num_threads=1024 -thread_warp_size=32", "dense_small_batch.gpu", [["TENSOR", [1600, 3072], "float32"], ["TENSOR", [768, 3072], "float32"], null, "float32"], {}], "config": {"index": 9, "code_hash": null, "entity": [["tile_k", "sp", [-1, 32]]]}, "result": [[1000000000.0], 4, 4.3437583446502686, 1648274480.7225487], "version": 0.2, "tvm_version": "0.9.dev0"}
diff --git a/include/tvm/ir/expr.h b/include/tvm/ir/expr.h
@@ -260,9 +260,10 @@ class GlobalVarNode : public RelayExprNode {
  */
 class GlobalVar : public RelayExpr {
  public:
-  TVM_DLL explicit GlobalVar(String name_hint, Type type = {});
+  TVM_DLL explicit GlobalVar(String name_hint, Type type = {}, Span span = {});
 
   TVM_DEFINE_OBJECT_REF_METHODS(GlobalVar, RelayExpr, GlobalVarNode);
+  TVM_DEFINE_OBJECT_REF_COW_METHOD(GlobalVarNode);
 };
 
 // PrimExprs that are useful as runtime containers.

diff --git a/include/tvm/relay/expr.h b/include/tvm/relay/expr.h
@@ -39,6 +39,12 @@
 #include "./type.h"
 
 namespace tvm {
+
+GlobalVar WithFields(GlobalVar global_var, Optional<String> opt_name_hint = {},
+                     Optional<Type> opt_type = {},
+                     Optional<VirtualDevice> opt_virtual_device = {},
+                     Optional<Span> opt_span = {});
+
 namespace relay {
 
 using Expr = tvm::RelayExpr;
@@ -97,8 +103,23 @@ class Constant : public Expr {
   TVM_DLL explicit Constant(runtime::NDArray data, Span span = Span());
 
   TVM_DEFINE_OBJECT_REF_METHODS(Constant, RelayExpr, ConstantNode);
+  TVM_DEFINE_OBJECT_REF_COW_METHOD(ConstantNode);
 };
 
+/*!
+ * \brief Returns the constant with given properties. A null property denotes 'no change'.
+ * Returns this if all properties are unchanged. Otherwise, returns a copy with the new fields.
+ * \param constant The constant to copy
+ * \param op_data The (optional) data for the copied constant. If none, ret_constant->data =
+ * constant->data.
+ * \param opt_virtual_device The (optional) virtual_device for the copied constant. If none,
+ * ret_constant->virtual_device = constant->virtual_device.
+ * \param opt_span The (optional) span for the copied constant. If none,
+ * ret_constant->span = constant->span.
+ */
+Constant WithFields(Constant constant, Optional<runtime::NDArray> opt_data = {},
+                    Optional<VirtualDevice> opt_virtual_device = {}, Optional<Span> opt_span = {});
+
 /*! \brief Tuple of multiple Exprs */
 class Tuple;
 /*! \brief Tuple container */

diff --git a/include/tvm/relay/expr_functor.h b/include/tvm/relay/expr_functor.h
@@ -240,6 +240,8 @@ class MixedModeVisitor : public ::tvm::relay::ExprVisitor {
    */
   explicit MixedModeVisitor(int visit_limit = 1);
 
+  using ExprVisitor::VisitExpr_;
+
   /*!
    * \brief VisitExpr is finalized to preserve call expansion of dataflow regions
    */

diff --git a/include/tvm/relay/function.h b/include/tvm/relay/function.h
@@ -173,7 +173,7 @@ namespace attr {
 /*! \brief Mark the function as a primitive function. */
 constexpr const char* kPrimitive = "Primitive";
 /*!
- * \brief Indicate the compiler that should be used for building this function.
+ * \brief Indicate the BYOC compiler that should be used for building this function.
  * When this is unset or set to "default", the default compilation pipeline will be used.
  */
 constexpr const char* kCompiler = "Compiler";

diff --git a/include/tvm/relay/op_attr_types.h b/include/tvm/relay/op_attr_types.h
@@ -41,24 +41,40 @@ using tir::BijectiveLayoutNode;
 using tir::Layout;
 using tir::LayoutAxis;
 
-/*! \brief operator pattern used in graph fusion */
+/*!
+ * \brief Operator pattern used to guide fusion.
+ *
+ *
+ *
+ */
 enum OpPatternKind {
-  // Elementwise operation
+  // Elementwise operator, eg relu.
+  // \code
+  //   out[i, j, k] = op(in[i, j, k])
+  // \endcode
+  // The underlying scalar op can always be moved to the point the input tensor was created.
   kElemWise = 0,
-  // Broadcasting operator, can always map output axis to the input in order.
-  // for example :code:`out[i, ax1, j, ax2] = input[i, j]`.
-  // Note that the axis need to be in order so transpose is not a bcast operator.
+  // Broadcasting operator, eg add.
+  // As for kElemWise, but some output axes may be broadcasted, and the remaining must correspond
+  // to input axes in order.
+  // \code
+  //   out[i, j, k] = op(in[i, j])
+  // \endcode
+  // (So transpose is not a kBroadcast).
   kBroadcast = 1,
-  // Injective operator, can always injectively map output axis to a single input axis.
-  // All injective operator can still be safely fused to injective and reduction.
+  // Injective operator, eg concat.
+  // Can always injectively map output axis to a single input axis.
+  // All kInjecting operators can be fused to kInjective and kCommReduce operators.
+  // Eg: concatenate
   kInjective = 2,
-  // Communicative reduction operator.
+  // Communicative reduction operator, eg sum.
   kCommReduce = 3,
-  // Complex operation, can still fuse elemwise operations into its output.
-  // but cannot chain another complex op
+  // Complex operation, eg conv2d. Often called the fused sub-graph's 'anchor node'.
+  // Can fuse kElemWise operations into its output, but cannot fuse additional kOutEWiseFusable
+  // operations.
   kOutEWiseFusable = 4,
-  // The pattern for tuple nodes. Can fuse into subsequent injective ops,
-  // but treated specially
+  // A tuple.
+  // Can fuse into subsequent injective ops, but treated specially.
   kTuple = 7,
   // Opaque operation, cannot fuse anything.
   kOpaque = 8

diff --git a/include/tvm/relay/transform.h b/include/tvm/relay/transform.h
@@ -273,6 +273,11 @@ TVM_DLL Pass InferType();
  */
 TVM_DLL Type InferTypeLocal(const Expr& expr);
 
+/*!
+ * \brief Infer the types of all sub-expression of expr.
+ */
+TVM_DLL Expr InferTypeExpr(const Expr& expr);
+
 /*!
  * \brief Search and eliminate common subexpression. For example, if there are
  * two expressions evaluated to an identical value, a single variable is created

diff --git a/include/tvm/target/compilation_config.h b/include/tvm/target/compilation_config.h
@@ -171,6 +171,8 @@ class CompilationConfig : public ObjectRef {
   TVM_DLL CompilationConfig(const transform::PassContext& pass_ctx, TargetMap legacy_target_map_arg,
                             Target optional_host_target_arg);
 
+  TVM_DLL CompilationConfig(const transform::PassContext& pass_ctx, Array<Target> targets);
+
   TVM_DEFINE_OBJECT_REF_METHODS(CompilationConfig, ObjectRef, CompilationConfigNode);
 };
 

diff --git a/include/tvm/target/target.h b/include/tvm/target/target.h
@@ -177,7 +177,17 @@ class Target : public ObjectRef {
    */
   static Target WithHost(const Target& target, const Target& host);
 
+  /*!
+   * \brief Returns true if \p this is a 'refinement of' \p that. Ie \p this
+   * and \p that are structurally equivalent except \p this may have 'compiler' and/or 'fusion_rule'
+   * attributes
+   */
+  bool IsRefinementOf(const Target& that) const;
+
  private:
+  Target(TargetKind kind, Optional<ObjectRef> host, String tag, Array<String> keys,
+         Map<String, ObjectRef> attrs);
+
   // enable with syntax.
   friend class TargetInternal;
   friend class With<Target>;

diff --git a/include/tvm/target/target_kind.h b/include/tvm/target/target_kind.h
@@ -384,6 +384,16 @@ inline TargetKindRegEntry& TargetKindRegEntry::set_name() {
 #define TVM_TARGET_KIND_REGISTER_VAR_DEF \
   static DMLC_ATTRIBUTE_UNUSED ::tvm::TargetKindRegEntry& __make_##TargetKind
 
+/* Special attributes on all target kinds:
+ *   "compiler": If set, the BYOC toolchain name this target is specialized to. This name appears:
+ *     - In the BYOC lowering function registered as "ext.relay.<toolchain>".
+ *     - As the "Compiler" attribute on "Primitive" functions.
+ *     - In the operator predicate bound to  the operator attribute "target.<toolchain>"
+ *     - In a @register_pattern_table("<toolchain>") annotation.
+ *   "fusion_rule": If set, the FusionRule to use for this target in the CollageFuseOps pass.
+ *   If missing, use built-in rules to derive the required FusionSpec.
+ */
+
 /*!
  * \def TVM_REGISTER_TARGET_KIND
  * \brief Register a new target kind, or set attribute of the corresponding target kind.
@@ -412,7 +422,9 @@ inline TargetKindRegEntry& TargetKindRegEntry::set_name() {
           .add_attr_option<String>("model")                       \
           .add_attr_option<Array<String>>("libs")                 \
           .add_attr_option<Target>("host")                        \
-          .add_attr_option<Integer>("from_device")
+          .add_attr_option<Integer>("from_device")                \
+          .add_attr_option<String>("compiler")                    \
+          .add_attr_option<ObjectRef /* actually PartitionRule */>("partition_rule")
 
 }  // namespace tvm
 

diff --git a/python/tvm/auto_scheduler/dispatcher.py b/python/tvm/auto_scheduler/dispatcher.py
@@ -332,7 +332,7 @@ class ApplyHistoryBestOrSample(ApplyHistoryBest):
     """
 
     def __init__(
-        self, records, sample_simple_workloads=False, cost_model_file=None, num_measure=-1
+            self, records, sample_simple_workloads=False, cost_model_file=None, num_measure=-1
     ):
         self.sample_simple_workloads = sample_simple_workloads
         self.num_measure = num_measure

diff --git a/python/tvm/autotvm/task/dispatcher.py b/python/tvm/autotvm/task/dispatcher.py
@@ -55,6 +55,9 @@ class DispatchContext(object):
     def __init__(self):
         self._old_ctx = DispatchContext.current
 
+    def contains(self, target, workload):
+        raise NotImplementedError()
+
     def query(self, target, workload):
         """
         Query the context to get the specific config for a template.
@@ -227,9 +230,11 @@ def load(self, records):
 
         counter = 0
         for inp, res in records:
+            #logger.info(f"inp={inp}, res={res}")
             counter += 1
-            if res.error_no != 0:
-                continue
+            #TODO(mbs): Cache error
+            #if res.error_no != 0:
+            #    continue
 
             # use target keys in tvm target system as key to build best map
             for k in inp.target.keys:
@@ -251,7 +256,12 @@ def load(self, records):
                 if np.mean(other_res.costs) > np.mean(res.costs):
                     best_by_model[key] = (inp, res)
 
-        logger.debug("Finish loading %d records", counter)
+        #logger.info("Finished loading %d records", counter)
+
+    def contains(self, target, workload):
+        #logger.info(
+        #    f"look for match with {target} and {workload} with {len(self._best_user_defined)} user-defined, {len(self.best_by_model)} model and {len(self.best_by_targetkey)} target entries")
+        return self._query_inside(target, workload) is not None
 
     def _query_inside(self, target, workload):
         if target is None:
@@ -311,8 +321,8 @@ def _query_inside(self, target, workload):
 
         if not _env.GLOBAL_SCOPE.silent:
             msg = (
-                "Cannot find config for target=%s, workload=%s. A fallback configuration "
-                "is used, which may bring great performance regression." % (target, workload)
+                    "Cannot find config for target=%s, workload=%s. A fallback configuration "
+                    "is used, which may bring great performance regression." % (target, workload)
             )
             if msg not in DispatchContext.warning_messages:
                 DispatchContext.warning_messages.add(msg)
@@ -426,9 +436,9 @@ def _query_inside(self, target, workload):
         key = (str(target), workload)
         if key not in self._global_cfg_dict:
             msg = (
-                "Config for target=%s, workload=%s is missing in ApplyGraphBest context. "
-                "A fallback configuration is used, which may bring great performance "
-                "regression." % (target, workload)
+                    "Config for target=%s, workload=%s is missing in ApplyGraphBest context. "
+                    "A fallback configuration is used, which may bring great performance "
+                    "regression." % (target, workload)
             )
             logger.warning(msg)
             cfg = FallbackConfigEntity()

diff --git a/python/tvm/contrib/cc.py b/python/tvm/contrib/cc.py
@@ -19,6 +19,7 @@
 import sys
 import os
 import subprocess
+import logging
 
 from .._ffi.base import py_str
 
@@ -238,6 +239,7 @@ def _linux_compile(output, objects, options, compile_cmd, compile_shared=False):
         cmd += objects
     if options:
         cmd += options
+    logging.info(f"invoking '{cmd}'")
     proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
     (out, _) = proc.communicate()
     if proc.returncode != 0:
-Original file line number
+Diff line change
@@ Expand Up / @@ -240,6 +240,8 @@ class MixedModeVisitor : public ::tvm::relay::ExprVisitor { @@
        */
       explicit MixedModeVisitor(int visit_limit = 1);
+      using ExprVisitor::VisitExpr_;
       /*!
        * \brief VisitExpr is finalized to preserve call expansion of dataflow regions
        */
@@ Expand Down @@