Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TIR] add loop partition hint pragma #9121

Merged
merged 4 commits into from
Sep 29, 2021

Conversation

wrongtest-intellif
Copy link
Contributor

@wrongtest-intellif wrongtest-intellif commented Sep 26, 2021

Currently LoopPartition pass will try to partition loops assiociated with condition in likely tag, it would be great if developers can take control of which loop to partition, no-matter whether the condition to eliminate is "likely" tagged or not.

The PR add a pragma attr key loop_partition_hint, which can be tagged explicitly in schedule phase. The loop partition pass will consider all arith conditions for hinted loop var.

Below are two examples of how explicit controlled loop partition benefits, the target is on Ubuntu20.08 i7-7700, with llvm version 11.0 :

  • For max pooling with padding inlined, which create conditional buffer accesses

    data = te.placeholder([1, 128, 56, 56], name="x")
    out = topi.nn.pool2d(data, kernel=[5, 5], stride=[1, 1], padding=[2, 2, 2, 2], pool_type="max", dilation=[1, 1], layout="NCHW")
    pad = out.op.input_tensors[0]
    x = tvm.nd.array(np.random.randint(0, 64, [1, 128, 56, 56]).astype("float32"))
    
    def test(do_partition):
        s = te.create_schedule([out.op])
        s[pad].compute_inline()
        n, c, h, w = s[out].op.axis
        if do_partition:
            s[out].pragma(h, "loop_partition_hint")
            s[out].pragma(w, "loop_partition_hint")
    
        with tvm.ir.transform.PassContext(config={"tir.LoopPartition": {"partition_const_loop": True}}):
            f = tvm.build(s, [data, out], "llvm")
        y = tvm.nd.array(np.zeros([1, 128, 56, 56]).astype("float32"))
        f(x, y)
        result = y.asnumpy()
        print(f.get_source("asm"))
        evaluator = f.time_evaluator(f.entry_name, tvm.cpu(), number=1000)
        print("partition=%s: %.3f millisecs" % (do_partition, evaluator(x, y).mean * 1000))
        return result
    
    r1 = test(do_partition=False)
    r2 = test(do_partition=True)
    testing.assert_allclose(r1, r2, rtol=1e-5)

    The performance I get:

    • no loop partition: 3.708 millisecs
    • with loop partition: 0.975 millisecs
  • For tiled matmul following TVM tensor expression tutorial, but with shape not divided by tiling factor. The tir split do not create a likely condition for it now.

    M, N, K = 1025, 1025, 1025
    dtype = "float32"
    dev = tvm.cpu()
    a = tvm.nd.array(np.random.rand(M, K).astype(dtype), dev)
    b = tvm.nd.array(np.random.rand(K, N).astype(dtype), dev)
    k = te.reduce_axis((0, K), "k")
    A = te.placeholder((M, K), name="A")
    B = te.placeholder((K, N), name="B")
    C = te.compute((M, N), lambda x, y: te.sum(A[x, k] * B[k, y], axis=k), name="C")
    f = te.create_prim_func([A, B, C])
    s = tvm.tir.Schedule(f)
    
    def evaluate_operation(s, target, optimization):
       with tvm.ir.transform.PassContext(config={"tir.LoopPartition": {"partition_const_loop": True}}):
           print(tvm.lower(s.mod["main"], [], simple_mode=True))
           func = tvm.build(s.mod["main"], [], target=target, name="mmult")
           assert func
    
       c = tvm.nd.array(np.zeros((M, N), dtype=dtype), dev)
       func(a, b, c)
       evaluator = func.time_evaluator(func.entry_name, dev, number=10)
       mean_time = evaluator(a, b, c).mean
       print("%s: %f" % (optimization, mean_time))
    
    # no opt
    evaluate_operation(s, target="llvm", optimization="none")
    
    # tiling and vectorize
    x, y, k = s.get_loops(s.get_block("C"))
    xo, xi = s.split(x, factors=[None, 32])
    yo, yi = s.split(y, factors=[None, 32])
    ko, ki = s.split(k, factors=[None, 4])
    s.reorder(xo, yo, ko, ki, xi, yi)
    s.vectorize(yi)
    evaluate_operation(s, target="llvm", optimization="blocking")
    
    # loop partition
    def pragma(s, rv, key):
       sref = s.get_sref(rv)
       loop = sref.stmt
       new_loop = tvm.tir.For(loop.loop_var, loop.min, loop.extent, loop.kind, loop.body, annotations={key: 1})
       s.state.replace(sref, new_loop)
    pragma(s, xo, "pragma_loop_partition_hint")
    pragma(s, yo, "pragma_loop_partition_hint")
    evaluate_operation(s, target="llvm",  optimization="loop_partition")

    The performance I get:

    • no opt: 1.374402
    • with tiling + vectorize: 0.843930
    • with tiling + vectorize + loop partition: 0.272183

@junrushao
Copy link
Member

CC @areusch @Hzfengsy @vinx13 @zxybazh @ZihengJiang would you guys review this PR? looks like it's relevant to some of our previous discussion

include/tvm/tir/stmt.h Show resolved Hide resolved
tests/python/unittest/test_tir_transform_loop_partition.py Outdated Show resolved Hide resolved
@wrongtest-intellif
Copy link
Contributor Author

@Hzfengsy hi~ I resolve the comment issues,can you kindly take another round of review:)?

@Hzfengsy Hzfengsy merged commit 198a8ab into apache:main Sep 29, 2021
@Hzfengsy
Copy link
Member

Thanks, @wrongtest

@wrongtest-intellif wrongtest-intellif deleted the support_loop_partition_hint branch September 29, 2021 09:45
AndrewZhaoLuo added a commit to AndrewZhaoLuo/tvm that referenced this pull request Sep 29, 2021
* main:
  Fix flaky NMS test by making sure scores are unique (apache#9140)
  [Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc (apache#9038)
  [LLVM] Make changes needed for opaque pointers (apache#9138)
  Arm(R) Ethos(TM)-U NPU codegen integration (apache#8849)
  [CI] Split Integration tests out of first phase of pipeline (apache#9128)
  [Meta Schedule][M3b] Runner (apache#9111)
  Fix Google Mock differences between Ubuntu 18.04 and 16.04 (apache#9141)
  [TIR] add loop partition hint pragma (apache#9121)
  fix things (apache#9146)
  [Meta Schedule][M3a] SearchStrategy (apache#9132)
  [Frontend][PyTorch] support for quantized conv_transpose2d op (apache#9133)
  [UnitTest] Parametrized test_conv2d_int8_intrinsics (apache#9143)
  [OpenCL] Remove redundant visit statement in CodeGen. (apache#9144)
  [BYOC] support arbitrary input dims for add/mul/relu of dnnl c_src codegen (apache#9127)
  [Relay][ConvertLayout] Support for qnn.conv2d_transpose (apache#9139)
  add nn.global_avgpool to fq2i (apache#9137)
  [UnitTests] Enable minimum testing on Vulkan target in CI (apache#9093)
  [Torch] Support returning quantized weights and bias for BYOC use cases (apache#9135)
  [Relay] Prepare for new plan_devices.cc (part II) (apache#9130)
  [microTVM][Zephyr] Add MIMXRT1050 board support (apache#9068)
AndrewZhaoLuo added a commit to AndrewZhaoLuo/tvm that referenced this pull request Sep 30, 2021
* main: (80 commits)
  Introduce centralised name transformation functions (apache#9088)
  [OpenCL] Add vectorization to cuda conv2d_nhwc schedule (apache#8636)
  [6/6] Arm(R) Ethos(TM)-U NPU codegen integration with `tvmc` (apache#8854)
  [microTVM] Add wrapper for creating project using a MLF (apache#9090)
  Fix typo (apache#9156)
  [Hotfix][Testing] Wait for RPCServer to be established (apache#9150)
  Update find cublas so it search default path if needed. (apache#9149)
  [TIR][LowerMatchBuffer] Fix lowering strides when source region has higher dimension than the buffer (apache#9145)
  Fix flaky NMS test by making sure scores are unique (apache#9140)
  [Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc (apache#9038)
  [LLVM] Make changes needed for opaque pointers (apache#9138)
  Arm(R) Ethos(TM)-U NPU codegen integration (apache#8849)
  [CI] Split Integration tests out of first phase of pipeline (apache#9128)
  [Meta Schedule][M3b] Runner (apache#9111)
  Fix Google Mock differences between Ubuntu 18.04 and 16.04 (apache#9141)
  [TIR] add loop partition hint pragma (apache#9121)
  fix things (apache#9146)
  [Meta Schedule][M3a] SearchStrategy (apache#9132)
  [Frontend][PyTorch] support for quantized conv_transpose2d op (apache#9133)
  [UnitTest] Parametrized test_conv2d_int8_intrinsics (apache#9143)
  ...
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 7, 2022
* add loop partition hint pragma

* fix unintialized var

* fix to remove hint at last

* use tir compare for loop partition testcase
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022
* add loop partition hint pragma

* fix unintialized var

* fix to remove hint at last

* use tir compare for loop partition testcase
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants