[LLVM] Codegen subroutine call when CallNode::op is GlobalVar #14901

Lunderberg · 2023-05-21T14:52:52Z

Previously, the CallNode::op must be a known built-in operation. This commit allows LLVM codegen to produce a subroutine call to another function within the same IRModule, with CallNode::op specifying the GlobalVar that represents that function.

Previously, `CodeGenLLVM` required all TIR PrimFuncs to have the `kGlobalSymbol` attribute, using its value as the externally-visible symbol in the generated library. This commit relaxes that requirement, using the presence of `kGlobalSymbol` to indicate whether a function should be exposed externally. If `kGlobalSymbol` is not defined, then the symbol name is generated from the name of the `tvm::GlobalVar` with the prefix `"_internal_"`, and the symbol is not exposed externally. Since this does not change the codegen behavior for any function that was previously supported, this is not a breaking change.

tvm-bot · 2023-05-21T14:52:55Z

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

No users to tag found in teams: llvm _{See #10317 for details}

_{Generated by tvm-bot}

Lunderberg · 2023-05-21T14:53:33Z

This PR is a subset of the functionality in #14862. As that PR has grown to a difficult-to-review size, I'm carving off independent chunks of it into separate PRs, after which #14862 can be rebased to a more manageable size.

The functionality tested in this commit was added across several recent PRs, each of which tested their features in isolation. This PR adds unit tests to validate the end-to-end behavior of TIR subroutine calls. PRs building up to this point: - TVMScript - apache#14889 - apache#14915 - apache#14919 - apache#14941 - Functionality improvements of existing TIR passes - apache#14913 - apache#14914 - apache#14918 - apache#14951 - Changes to the TIR lowering flow - apache#14942 - apache#14985 - Codegen updates - apache#14958 - apache#14901 - Compatibility updates/fixes - apache#14892 - apache#14950 - apache#14943 - apache#14944 - apache#14945 - apache#14952 - apache#14982 - apache#14949

…#14901) * [CodeGen][LLVM] Codegen to generate internal functions Previously, `CodeGenLLVM` required all TIR PrimFuncs to have the `kGlobalSymbol` attribute, using its value as the externally-visible symbol in the generated library. This commit relaxes that requirement, using the presence of `kGlobalSymbol` to indicate whether a function should be exposed externally. If `kGlobalSymbol` is not defined, then the symbol name is generated from the name of the `tvm::GlobalVar` with the prefix `"_internal_"`, and the symbol is not exposed externally. Since this does not change the codegen behavior for any function that was previously supported, this is not a breaking change. * [Codegen][LLVM] Handle callsite for internal functions * [UnitTest][LLVM] Added test for LLVM codegen for subroutine

The functionality tested in this commit was added across several recent PRs, each of which tested their features in isolation. This PR adds unit tests to validate the end-to-end behavior of TIR subroutine calls. PRs building up to this point: - TVMScript - apache#14889 - apache#14915 - apache#14919 - apache#14941 - Functionality improvements of existing TIR passes - apache#14913 - apache#14914 - apache#14918 - apache#14951 - Changes to the TIR lowering flow - apache#14942 - apache#14985 - Codegen updates - apache#14958 - apache#14901 - Compatibility updates/fixes - apache#14892 - apache#14950 - apache#14943 - apache#14944 - apache#14945 - apache#14952 - apache#14982 - apache#14949

LeiWang1999 · 2023-06-20T14:05:10Z

looks like this pr introduced some rocm backend issues.

code to reproduce:

import tvm
from tvm import te
import numpy as np
import tvm.testing
from tvm.script import tir as T
from tvm.tir import TensorIntrin


M = 64
N = 64

@tvm.script.ir_module
class MyModule:
    @T.prim_func
    def main(a: T.handle, b: T.handle):
        T.func_attr({"global_symbol": "main"})
        A = T.match_buffer(a, (M, N), dtype="float32")
        B = T.match_buffer(b, (M, N), dtype="float32")
        for i, j in T.grid(M, N):
            with T.block("B"):
                vi, vj = T.axis.remap("SS", [i, j])
                B[vi, vj] = A[vi, vj] * 2.0


ir_module = MyModule
sch = tvm.tir.Schedule(ir_module, debug_mask="all")

block_b = sch.get_block("B")

i, j = sch.get_loops(block_b)
sch.bind(i, "blockIdx.x")
sch.bind(j, "threadIdx.x")

print(sch.mod["main"].script())

ctx = tvm.rocm(0)
with tvm.transform.PassContext():
    rocm_mod = tvm.build(sch.mod, target="rocm")

'''
Traceback (most recent call last):
  File "../../tvm_rocm/memory_copy.py", line 38, in <module>
    rocm_mod = tvm.build(sch.mod, target="rocm -mcpu=gfx90a")
  File "/home/aiscuser/v-leiwang3/tvm/python/tvm/driver/build_module.py", line 281, in build
    rt_mod_host = _driver_ffi.tir_to_runtime(annotated_mods, target_host)
  File "/home/aiscuser/v-leiwang3/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 238, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  7: TVMFuncCall
  6: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::runtime::Module (tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)>::AssignTypedLambda<tvm::{lambda(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)#6}>(tvm::{lambda(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)#6}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tvm::runtime::TVMRetValue)
  5: tvm::TIRToRuntime(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target const&)
  4: tvm::codegen::Build(tvm::IRModule, tvm::Target)
  3: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::runtime::Module (tvm::IRModule, tvm::Target)>::AssignTypedLambda<tvm::runtime::Module (*)(tvm::IRModule, tvm::Target)>(tvm::runtime::Module (*)(tvm::IRModule, tvm::Target), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  2: tvm::codegen::BuildAMDGPU(tvm::IRModule, tvm::Target)
  1: tvm::codegen::CodeGenLLVM::Finish()
  0: tvm::codegen::CodeGenLLVM::Verify() const
  File "/home/aiscuser/v-leiwang3/tvm/src/target/llvm/codegen_llvm.cc", line 361
TVMError: LLVM module verification failed with the following errors: 
Calling convention requires void return type
ptr @main_kernel
Function return type does not match operand type of return inst!
  ret void
 i32
'''

this can be resolved through:

// src/target/llvm/codegen_amdgpu.cc
class CodeGenAMDGPU : public CodeGenLLVM {
 public:
  CodeGenAMDGPU() = default;
  virtual ~CodeGenAMDGPU() = default;

  llvm::Function* DeclareFunction(const GlobalVar& gvar, const PrimFunc& f) {
    // amd gpu kernel function should be void return
    return this->DeclareFunctionInternal(gvar, f, true);
  }

because amd gpu kernels should be void return but by default it wll call a llvm codegen implementation which returns int by default, Previously there were no issues, until this code snippet was added.

// src/target/llvm/codegen_llvm.cc 
  if (auto it = functions_.find(gvar.get()); it != functions_.end()) {
    return it->second;
  }

and if we wanna leverage shared memory, there's another issue:

TVMError: LLVM module verification failed with the following errors: 
Global is external, but doesn't have external or weak linkage!
ptr addrspace(3) @shmem
Global is external, but doesn't have external or weak linkage!
ptr addrspace(3) @shmem.1

I still have no idea how to resolve it, please help me take a look @Lunderberg

tqchen · 2023-06-20T15:09:10Z

We should not support external global in the device codegen

tqchen · 2023-06-20T15:11:21Z

Following up on this. Device side codegen follows a different convention and likely we cannot simply have a call from the host to device via the external call. Instead, we need to do them through packed func calls (and we need to do them via PackedFunc convention)

So I wonder if we should revert some of the changes in the device codegen(nvptx, rocm) and only support subroutine rewriting in host side. cc @Lunderberg

Lunderberg · 2023-06-20T15:24:46Z

Agreed, the device-side codegen should not handle host-to-device calls. If any such calls exist (e.g. resulting from SplitHostDevice), they are lowered to PackedFunc API in LowerDeviceKernelLaunch.

I'm investigating the issue at the moment, and can reproduce it locally. (Thank you @LeiWang1999 for the test case!) It looks like it is a discrepancy between the PrimFunc::ret_type and the CallNode::dtype, currently checking where that discrepancy gets introduced.

Lunderberg · 2023-06-20T16:06:56Z

@LeiWang1999 It looks like the discrepancy is specifically between the return type of the PrimFunc (DataType::Void()) and the return type provided to LLVM-IR (i32). I can confirm that your fix (adding CodeGenAMDGPU::DeclareFunction) does avoid the issue in the test script, but I can't reproduce the shared memory issue. Is that a different test case for that failure mode?

Long-term, rather than providing the ret_void = true argument, it may be more robust to use the PrimFunc::ret_type directly. That is, because the PrimFunc::ret_type is void, then the ret_void argument duplicates information without adding expressibility. There's a TODO in the comments (from PR#5107) stating that this was an original intention of the code.

Prior to this commit, the `"ret_void"` argument needed to be explicitly provided to `CodeGenLLVM::AddFunction` and `CodeGenLLVM::DeclareFunction`. If this was inconsistent with the `builtin::ret()` usage within the `PrimFunc`, this could cause the incorrect return type in the generated LLVM-IR, resulting in LLVM IR verification failures. This commit removes the `"ret_void"` argument, instead using the type annotation in `PrimFunc::ret_type`, removing this opportunity for inconsistency. This PR is intended to fix a ROCm regression reported in apache#14901 (comment).

Lunderberg · 2023-06-20T16:52:49Z

So I wonder if we should revert some of the changes in the device codegen(nvptx, rocm) and only support subroutine rewriting in host side. cc @Lunderberg

@tqchen Forgot to respond to this part. That is the intent, that the LLVM subroutines are supported only on the host side. The primary change to nvptx/rocm were (a) to make sure they could provide the GlobalVar when delegating to CodeGenLLVM::AddFunction and (b) to call DeclareFunction prior to AddFunction. If we want to explicitly prevent TIR-to-TIR calls in nvptx/rocm, we could revert the changes of type (b) without impacting the changes of type(a). I suppose we could also move the callee resolution into CodeGenLLVM::Finish, since that would avoid the need for DeclareFunction altogether, but that would be a larger change.

Lunderberg · 2023-06-20T16:52:52Z

@LeiWang1999 Can you try running your test case with PR #15127 ? This is a draft PR that removes the ret_void argument altogether, instead using prim_func->ret_type. This resolved the failure mode in the test case you provided, though I haven't tested it against the shared memory issue you mentioned.

LeiWang1999 · 2023-06-20T17:18:51Z

Thanks @Lunderberg , I will take a look later, and to reproduce the shared memory external global issue, the test case can be accessed through https://gist.github.com/LeiWang1999/d0db6e802f358a2a42fca32b5cd42441

Lunderberg · 2023-06-20T19:02:53Z

@LeiWang1999 Thank you for the test case. I can reproduce the shmem error, though it looks like it predates the LLVM changes in this PR. Unfortunately, that means there isn't a clear before/after that could be compared and fixed. Do you know if the shared memory on rocm has worked in the past?

FYI, I also condensed the test case down the the following test case, which reproduces the same error:

import tvm
from tvm.script import tir as T

@T.prim_func
def func():
    A = T.alloc_buffer(1, scope="shared")

    for i in T.thread_binding(1, thread="threadIdx.x"):
        A[0] = 0.0

tvm.build(func, target="rocm")

LeiWang1999 · 2023-06-21T04:53:38Z

@Lunderberg at least from this commit 1d98634 the rocm shared memory should work.

The functionality tested in this commit was added across several recent PRs, each of which tested their features in isolation. This PR adds unit tests to validate the end-to-end behavior of TIR subroutine calls. PRs building up to this point: - TVMScript - apache#14889 - apache#14915 - apache#14919 - apache#14941 - Functionality improvements of existing TIR passes - apache#14913 - apache#14914 - apache#14918 - apache#14951 - Changes to the TIR lowering flow - apache#14942 - apache#14985 - Codegen updates - apache#14958 - apache#14901 - Compatibility updates/fixes - apache#14892 - apache#14950 - apache#14943 - apache#14944 - apache#14945 - apache#14952 - apache#14982 - apache#14949

Prior to this commit, the `"ret_void"` argument needed to be explicitly provided to `CodeGenLLVM::AddFunction` and `CodeGenLLVM::DeclareFunction`. If this was inconsistent with the `builtin::ret()` usage within the `PrimFunc`, this could cause the incorrect return type in the generated LLVM-IR, resulting in LLVM IR verification failures. This commit removes the `"ret_void"` argument, instead using the type annotation in `PrimFunc::ret_type`, removing this opportunity for inconsistency. This PR is intended to fix a ROCm regression reported in #14901 (comment).

LeiWang1999 · 2023-06-24T07:14:32Z

@LeiWang1999 Can you try running your test case with PR #15127 ? This is a draft PR that removes the ret_void argument altogether, instead using prim_func->ret_type. This resolved the failure mode in the test case you provided, though I haven't tested it against the shared memory issue you mentioned.

I think ret void argument issue has been fixed through the related prs, thanks @Lunderberg ! but from my test the external global function still remains.

zhangxiao-stack · 2023-06-25T06:42:09Z

@LeiWang1999 Thank you for the test case. I can reproduce the shmem error, though it looks like it predates the LLVM changes in this PR. Unfortunately, that means there isn't a clear before/after that could be compared and fixed. Do you know if the shared memory on rocm has worked in the past?

FYI, I also condensed the test case down the the following test case, which reproduces the same error:
import tvm
from tvm.script import tir as T

@T.prim_func
def func():
    A = T.alloc_buffer(1, scope="shared")

    for i in T.thread_binding(1, thread="threadIdx.x"):
        A[0] = 0.0

tvm.build(func, target="rocm")

I am facing the same issue on the ROCm platform. Is there a solution yet?

LeiWang1999 · 2023-06-25T06:45:42Z

@zhangxiao-stack checkout this branch https://github.com/LeiWang1999/tvm/blob/lei/feat-hip/tests/python/unittest/test_tir_schedule_tensorize_hip_mfma.py#L112 this is a alter hip source code generation, it might be one possible impermanent solution on rocm.

The functionality tested in this commit was added across several recent PRs, each of which tested their features in isolation. This PR adds unit tests to validate the end-to-end behavior of TIR subroutine calls. PRs building up to this point: - TVMScript - apache#14889 - apache#14915 - apache#14919 - apache#14941 - Functionality improvements of existing TIR passes - apache#14913 - apache#14914 - apache#14918 - apache#14951 - Changes to the TIR lowering flow - apache#14942 - apache#14985 - Codegen updates - apache#14958 - apache#14901 - Compatibility updates/fixes - apache#14892 - apache#14950 - apache#14943 - apache#14944 - apache#14945 - apache#14952 - apache#14982 - apache#14949

Analogous to #14901, treat GlobalVar callees as internal function calls in CodeGenC. This specific PR doesn't provide new end-to-end functionality, as the target="c" backend isn't compiled. It does lead into allowing subroutines in any target whose codegen derives from CodeGenC, which will depend on the single-module lowering flow in #14985. * [CodeGenC] Added unit tests for desired behavior * [CodeGenC] Handle GlobalVar callee as internal function call * Update CodeGenC subclasses for updated interface - Call `DeclareFunction` for each `PrimFunc`, prior to any `AddFunction` calls - Provide both `GlobalVar` and `PrimFunc` to `AddFunction` calls. * Updated CRT test to expect forward declaration * Provide forward declarations for call_extern in cmsis * Avoid duplicate forward declaration C's automatic pointer cast (e.g. `void*` to `int*`) means that use of the arguments to infer the function signature may be incorrect. If a `call_extern` refers to a function within the same module, only output a single forward declaration based on the PrimFunc's parameters, not based on the CallNode's arguments. * Updated expected ptx cuda * Cast the AOT pools to the arg type * Improved tvm::GetType for tvm_access_ptr and address_of These `Call` instances can return a `PointerType(PrimType(pointee_dtype))` rather than a `PrimType(DataType::Handle())`. * [ARM][Topi] Update micro kernels to use same argument type as caller Previously, the micro kernels for gemm, avg_pool, max_pool, and tensordot relied on C's implicit type conversions for the arguments, when the caller's argument types differ from the signature's parameter types. This works, except when the codegen has auto-generated a forward declaration based on the caller's argument types, such as during AOT, which then causes a conflicting definition. Since the codegen cannot determine the functions names from the `"pragma_import_c"` in order to suppress these forward declarations, this conflict can be more easily resolved by updating the micro kernel signatures. The three types of mismatches are below. - Use of `int` or `long` parameters, whose width may vary by compiler, instead of fixed-width types. - TIR expecting the data array's integer type to also be used as an error code's return type, rather than the micro kernels' `int32_t` error code. - Pointer conversion done during argument conversion. Type conversions are done at the start of each micro kernel, to avoid changing types that are used within the computational sections of each micro kernel. * Updated unit tests with private=True Required for internal functions after PR #15214 * Docstring updates from review

The functionality tested in this commit was added across several recent PRs, each of which tested their features in isolation. This PR adds unit tests to validate the end-to-end behavior of TIR subroutine calls. PRs building up to this point: - TVMScript - apache#14889 - apache#14915 - apache#14919 - apache#14941 - Functionality improvements of existing TIR passes - apache#14913 - apache#14914 - apache#14918 - apache#14951 - Changes to the TIR lowering flow - apache#14942 - apache#14985 - Codegen updates - apache#14958 - apache#14901 - Compatibility updates/fixes - apache#14892 - apache#14950 - apache#14943 - apache#14944 - apache#14945 - apache#14952 - apache#14982 - apache#14949

Lunderberg added 3 commits May 21, 2023 08:41

[Codegen][LLVM] Handle callsite for internal functions

920ad58

[UnitTest][LLVM] Added test for LLVM codegen for subroutine

ed149de

masahi approved these changes May 22, 2023

View reviewed changes

masahi merged commit cd45513 into apache:main May 22, 2023

Lunderberg deleted the llvm_codegen_subroutine_in_same_irmodule branch May 22, 2023 12:58

Lunderberg mentioned this pull request Jun 14, 2023

[CodeGenC] Handle GlobalVar callee as internal function call #15103

Merged

LeiWang1999 mentioned this pull request Jun 20, 2023

[TensorIR][ROCm] AMD Matrix Core Support #15106

Merged

Lunderberg mentioned this pull request Jun 20, 2023

[LLVM] Remove the "ret_void" argument of AddFunction #15127

Merged

ysh329 mentioned this pull request Jul 12, 2023

[Release] v0.13.0 Release Candidate Notes #15295

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLVM] Codegen subroutine call when CallNode::op is GlobalVar #14901

[LLVM] Codegen subroutine call when CallNode::op is GlobalVar #14901

Lunderberg commented May 21, 2023

tvm-bot commented May 21, 2023

Lunderberg commented May 21, 2023

LeiWang1999 commented Jun 20, 2023 •

edited

Loading

tqchen commented Jun 20, 2023

tqchen commented Jun 20, 2023 •

edited

Loading

Lunderberg commented Jun 20, 2023

Lunderberg commented Jun 20, 2023

Lunderberg commented Jun 20, 2023

Lunderberg commented Jun 20, 2023

LeiWang1999 commented Jun 20, 2023

Lunderberg commented Jun 20, 2023

LeiWang1999 commented Jun 21, 2023

LeiWang1999 commented Jun 24, 2023 •

edited

Loading

zhangxiao-stack commented Jun 25, 2023

LeiWang1999 commented Jun 25, 2023 •

edited

Loading

[LLVM] Codegen subroutine call when CallNode::op is GlobalVar #14901

[LLVM] Codegen subroutine call when CallNode::op is GlobalVar #14901

Conversation

Lunderberg commented May 21, 2023

tvm-bot commented May 21, 2023

Lunderberg commented May 21, 2023

LeiWang1999 commented Jun 20, 2023 • edited Loading

tqchen commented Jun 20, 2023

tqchen commented Jun 20, 2023 • edited Loading

Lunderberg commented Jun 20, 2023

Lunderberg commented Jun 20, 2023

Lunderberg commented Jun 20, 2023

Lunderberg commented Jun 20, 2023

LeiWang1999 commented Jun 20, 2023

Lunderberg commented Jun 20, 2023

LeiWang1999 commented Jun 21, 2023

LeiWang1999 commented Jun 24, 2023 • edited Loading

zhangxiao-stack commented Jun 25, 2023

LeiWang1999 commented Jun 25, 2023 • edited Loading

LeiWang1999 commented Jun 20, 2023 •

edited

Loading

tqchen commented Jun 20, 2023 •

edited

Loading

LeiWang1999 commented Jun 24, 2023 •

edited

Loading

LeiWang1999 commented Jun 25, 2023 •

edited

Loading