Skip to content

Commit

Permalink
** Collage v2 sketch ***
Browse files Browse the repository at this point in the history
- Polish compiler_function_utils for splitting out
- Mark functions as extern.
- Get rid of relay.ext.cutlass
- kExternalSymbol:String ----> kExtern:Bool
- Host glitch if PlanDevices run before CollagePartition
- Fix unit test
- Make load_static_library first class python func
- Get CUTLASS going on graph executor as well as vm
- Include export_library in estimate_seconds
- Rollback DSOLibrary changes.
- Add StaticLibraryNode and switch CUTLASS to use it
  This avoids the crazy serialize/deserialize/load hackery, which I'll now remove.
- Get running again
- CUTLASS picks up all options from 'cutlass' external codegen target.
- Revert false starts with cutlass handling
- Get CUTLASS going with program-at-a-time tuning and compilation instead of
  function at a time.
- Save DSOLibraries by contents rather than by reference.
- futzing with libraries
- revert unnecessary cutlass changes
- starting unit test for dsolibrary save
- Prepare scalar changes for PR.
- Eager candidate cost measurement.
- More conv2d_cudnn.cuda training records.
- cleanup before rebase
- Use 'regular' target when build, not external codegen target
- Tuned for -libs=cudnn
- Tune before collage not during
- Bring over target changes
- Fix GetSpecName
- Try again on python target changes, this time leave check_and_update_host_consist unchanged
- Revert python target changes to try again less agressively
- Few other cleanups
- Switch to 'external codegen targets' style
- Woops, run just_tvm after collage to pick up tuning logs
- Finish tuning for rtx3070
- Run them all!
- Update tuning logs
- Share global vars in the candidate function cache
- Finished tuning mobilenet, started on resnet50.
- Include model name in logs to make sure we don't get anything mixed up
- Drop -arch=sm_80
- Fix MaxCoalesce
- Attach external_symbol to lifted functions
- Add missing node registration, but leave VisitAttrs empty for now
- Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing.
- Finish tuning resnext50
- Improve coelescing
- Account for coelesced functions when outlining final module
- Fix caching, for real this time.
- More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d.
- OutlineExternalFunction both when preparing to estimate cost and after optimal
  partitioning applied.
- Use fp16 in TensorRT only if model's 'main_dtype' is float16.
- Fix CostEstimator caching issue
- More Target cleanup (while waiting for tuning runs)
- Better logging of candidates
- Support export to ONNX
- Fix merge
- Part-way through tuning for mobilenet.
- Add resnext50_32x4d
- Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them
- Still trying
- Trying to track down weird failure in conv2d compute.
- Switch tensorrt to be fully pattern & composite function based
- Combiner rule for tuple projection
- Allow build to fail in estimate_seconds
- Add mobilenetv2 and resnet50v2 to menagerie
- Update CompilationConfig to handle target refinement
- Nuke remaining uses of TargetMap in favor of CompilationConfig
  (still needs to be pushed into python side)
- Save/Load dso libraries (needed for Cutlass with separated run)
- Move models into separate file
- gpt2_extract_16 and autotvm tuning log
- Handle missing tuning log files
- fp16 support in scalars and the tensorrt runtime.
- Wrap runner in nsys nvprof if requested
- Enforce strict compile/run time separation in preparation for profiling
- Better logging of final optimal partitioning and state of all candidates
- Fix handling of tuples and InlineComposites fixup pass.
- Fix TensorRT pattern bugs
- Pass max_max_depth via PassContext
- Better logging so can quickly compare specs
- BUG: Benchmark the partitioned rather than original model!!!
- Use median instead of mean
- Back to GPT2
- Make sure all function vars have a type
- Don't extract tasks if estimating BYOC-only
  (Was double-tuning every cutlass kernel).
- Make sure cudnn pattern table is registered
- Enable cudnn, get rid of support for op-predicate based BYOC integrations
- Enable cublas
- And yet another go at pruning unnecessary candidates.
- Another go at pruning unnecessary candidates
- Fix CompositePartitionRule use
- Fix a few bugs with new TensorRT pattern-based integration
- Rework RemoveSubCandidatesCombinerRule for soundness
- Better logging
- Bug fixes
- Implement critical nodes idea for avoiding obviously unnecessary candidates
- Promote DataflowGraph from alias to class so can cache downstream index set
- Quick check to avoid unioning candidates which would create a cycle
- Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates
- GetFunction can legitimately return nullptr
- rename tuning log
- Support for int64 literals
- Switch GPT2 to plain model
- Fix library cloberring issue for cutlass
- actually checkin 'built in' tuning log (covers mnist & gpt2 only)
- trying to debug gpt2
- Update TargetKind attribute name
- working through gpt2 issues
- checkin tuning records for MNIST (with hack to not retry failed winograd)
- Autotvm tuning disabled if log file empty (default)
- Autotvm tuning during search working
- tune during search
  (but does not load tuned records after search!)
- About to add tuning to estimate_seconds
- Split out the combiner rules & make them FFI friendly
- Rework comments
- Estimate IRModule instead of Function (closer to meta_schedule iface)
- Add 'host' as first-class partitioning spec
  (Avoids special casing for the 'leave behind for the VM' case)
- Move CollagePartitioner to very start of VM compiler flow (not changing legacy)
- Fix bugs etc with new SubGraph::Rewrite approach
  Ready for updating RFC to focus on partitioning instead of fusion.
- Working again after partition<->fusion split.
- Add PrimitivePartitionRule
- Refactor SubGraph Extract/Rewrite
- Rename kernel->partition, fusion->partition
- Next: make nesting in "Primitive" an explicit transform
- respect existing target constraints from device planner
- make 'compiler' and 'fusion_rule' attributes avail on all target kinds
- moved design to tvm-rfcs, apache/tvm-rfcs#62
- incorporate comments
- avoid repeated fusion
- fix trt type checking
- better logs
- pretty print primitive rules
- fix tensorrt
- multiple targets per spec
- don't extract candidate function until need cost
  Need to bring CombineByPrimitives back under control since lost depth limit.
- cleaned up fusion rule names
- added 'fuse anything touching' for BYOC
- Finish dd example
- Add notion of 'MustLower', even if a candidate fires may still need to consider
  leaving node behind for VM (especially for constants).
- starting example
- finished all the dd sections
- documentation checkpoint
- docs checkpoint
- more design
- starting on dd
- runs MNIST with TVM+CUTLASS+TRT
- cutlass function-at-a-time build
- need to account for build_cutlass_kernels_vm
- move cutlass tuning into relay.ext.cutlass path to avoid special case
- add utils
- don't fuse non-scalar constants for tvm target.
- stuck on cuda mem failure on conv2d, suspect bug in main
- where do the cutlass attrs come from?
- running, roughtly
- pretty printing, signs of life
- wire things up again
- Switch SubGraph and CandidateKernel to TVM objects
- naive CombineByKindFusionRule, just to see what we're up agaist
  Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying.
- preparing to mimic FuseOps
- rework SubGraph to use IndexSet
- rough cut at MaximalFusion
- split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph.
- top-down iterative handling of sub-sub-graphs
- about to give up on one-pass extraction with 'sub-sub-graphs'
- Add notion of 'labels' to sub-graphs
- Rework FusionRules to be more compositional
- partway through reworking fusion rules, broken
- SubGraph::IsValid, but still need to add no_taps check
- dataflow rework, preparing for SubGraph::IsValid
- explode into subdir
- mnist with one fusion rule (which fires twice) working
- switch to CandidateKernelIndex
- Confirm can measure 'pre-annotated' primitive functions
- checkpoint
- stuff
- more sketching
- dominator logging
  • Loading branch information
mbs-octoml committed Jun 3, 2022
1 parent 9dceb4e commit 97b5e54
Show file tree
Hide file tree
Showing 89 changed files with 13,895 additions and 932 deletions.
1 change: 1 addition & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -292,6 +292,7 @@ tvm_file_glob(GLOB_RECURSE RELAY_OP_SRCS
)
tvm_file_glob(GLOB_RECURSE RELAY_PASS_SRCS
src/relay/analysis/*.cc
src/relay/collage/*.cc
src/relay/transforms/*.cc
src/relay/quantize/*.cc
)
Expand Down
314 changes: 314 additions & 0 deletions collage_autotvm_rtx3070.tuninglog

Large diffs are not rendered by default.

19 changes: 19 additions & 0 deletions include/tvm/relay/expr.h
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,16 @@
#include "./type.h"

namespace tvm {

/*!
* \brief Returns \p global_var with the given properties. A null property denotes 'no change'.
* Returns \p global_var if all properties are unchanged. Otherwise, returns a copy with the new
* fields.
*/
GlobalVar WithFields(GlobalVar global_var, Optional<String> opt_name_hint = {},
Optional<Type> opt_type = {}, Optional<VirtualDevice> opt_virtual_device = {},
Optional<Span> opt_span = {});

namespace relay {

using Expr = tvm::RelayExpr;
Expand Down Expand Up @@ -97,8 +107,17 @@ class Constant : public Expr {
TVM_DLL explicit Constant(runtime::NDArray data, Span span = Span());

TVM_DEFINE_OBJECT_REF_METHODS(Constant, RelayExpr, ConstantNode);
TVM_DEFINE_OBJECT_REF_COW_METHOD(ConstantNode);
};

/*!
* \brief Returns \p constant with the given properties. A null property denotes 'no change'.
* Returns \p constant if all properties are unchanged. Otherwise, returns a copy with the new
* fields.
*/
Constant WithFields(Constant constant, Optional<runtime::NDArray> opt_data = {},
Optional<VirtualDevice> opt_virtual_device = {}, Optional<Span> opt_span = {});

/*! \brief Tuple of multiple Exprs */
class Tuple;
/*! \brief Tuple container */
Expand Down
2 changes: 2 additions & 0 deletions include/tvm/relay/expr_functor.h
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,8 @@ class MixedModeVisitor : public ::tvm::relay::ExprVisitor {
*/
explicit MixedModeVisitor(int visit_limit = 1);

using ExprVisitor::VisitExpr_;

/*!
* \brief VisitExpr is finalized to preserve call expansion of dataflow regions
*/
Expand Down
37 changes: 25 additions & 12 deletions include/tvm/relay/op_attr_types.h
Original file line number Diff line number Diff line change
Expand Up @@ -41,24 +41,37 @@ using tir::BijectiveLayoutNode;
using tir::Layout;
using tir::LayoutAxis;

/*! \brief operator pattern used in graph fusion */
/*!
* \brief Operator pattern used to guide fusion.
*/
enum OpPatternKind {
// Elementwise operation
// Elementwise operator, eg relu.
// \code
// out[i, j, k] = op(in[i, j, k])
// \endcode
// The underlying scalar op can always be moved to the point the input tensor was created.
kElemWise = 0,
// Broadcasting operator, can always map output axis to the input in order.
// for example :code:`out[i, ax1, j, ax2] = input[i, j]`.
// Note that the axis need to be in order so transpose is not a bcast operator.
// Broadcasting operator, eg add.
// As for kElemWise, but some output axes may be broadcasted, and the remaining must correspond
// to input axes in order.
// \code
// out[i, j, k] = op(in[i, j])
// \endcode
// (So transpose is not a kBroadcast).
kBroadcast = 1,
// Injective operator, can always injectively map output axis to a single input axis.
// All injective operator can still be safely fused to injective and reduction.
// Injective operator, eg concat.
// Can always injectively map output axis to a single input axis.
// All kInjecting operators can be fused to kInjective and kCommReduce operators.
// Eg: concatenate
kInjective = 2,
// Communicative reduction operator.
// Communicative reduction operator, eg sum.
kCommReduce = 3,
// Complex operation, can still fuse elemwise operations into its output.
// but cannot chain another complex op
// Complex operation, eg conv2d. Often called the fused sub-graph's 'anchor node'.
// Can fuse kElemWise operations into its output, but cannot fuse additional kOutEWiseFusable
// operations.
kOutEWiseFusable = 4,
// The pattern for tuple nodes. Can fuse into subsequent injective ops,
// but treated specially
// A tuple.
// Can fuse into subsequent injective ops, but treated specially.
kTuple = 7,
// Opaque operation, cannot fuse anything.
kOpaque = 8
Expand Down
5 changes: 5 additions & 0 deletions include/tvm/relay/transform.h
Original file line number Diff line number Diff line change
Expand Up @@ -281,6 +281,11 @@ TVM_DLL Pass InferType();
*/
TVM_DLL Type InferTypeLocal(const Expr& expr);

/*!
* \brief Infer the types of all sub-expression of expr.
*/
TVM_DLL Expr InferTypeExpr(const Expr& expr);

/*!
* \brief Search and eliminate common subexpression. For example, if there are
* two expressions evaluated to an identical value, a single variable is created
Expand Down
18 changes: 15 additions & 3 deletions python/tvm/autotvm/task/dispatcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,10 @@ class DispatchContext(object):
def __init__(self):
self._old_ctx = DispatchContext.current

# TODO(mbs): Collage only: Allow cache query
def contains(self, target, workload):
raise NotImplementedError()

def query(self, target, workload):
"""
Query the context to get the specific config for a template.
Expand Down Expand Up @@ -297,8 +301,9 @@ def load(self, records):
counter = 0
for inp, res in joint_records:
counter += 1
if res.error_no != 0:
continue
# TODO(mbs): Collage only: Cache the error so don't re-tune
# if res.error_no != 0:
# continue

# use target keys in tvm target system as key to build best map
for k in inp.target.keys:
Expand All @@ -320,7 +325,14 @@ def load(self, records):
if np.mean(other_res.costs) > np.mean(res.costs):
best_by_model[key] = (inp, res)

logger.debug("Finish loading %d records", counter)
# TODO(mbs): Collage only: Too verbose
# logger.info("Finished loading %d records", counter)

# TODO(mbs): Collage only: Allow cache query
def contains(self, target, workload):
#logger.info(
# f"look for match with {target} and {workload} with {len(self._best_user_defined)} user-defined, {len(self.best_by_model)} model and {len(self.best_by_targetkey)} target entries")
return self._query_inside(target, workload) is not None

def _query_inside(self, target, workload):
if target is None:
Expand Down
2 changes: 1 addition & 1 deletion python/tvm/contrib/cutlass/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,4 @@
# specific language governing permissions and limitations
# under the License.
"""BYOC support for CUTLASS."""
from .build import tune_cutlass_kernels, build_cutlass_kernels, build_cutlass_kernels_vm
from .build import num_cutlass_partitions, finalize_modules, finalize_modules_vm
Loading

0 comments on commit 97b5e54

Please sign in to comment.