[TIR] SplitHostDevice, handle subroutines #14918

Lunderberg · 2023-05-22T18:03:05Z

This PR refactors SplitHostDevice into three separate transformations. Previously, SplitHostDevice would replace device regions with a builtin::tvm_call_packed() node to replace the extracted region. After this PR, this process is performed in three separate steps.

AnnotateDeviceRegion: Annotate the regions that should be executed on another target.
SplitHostDevice: Extract the annotated region into an independent PrimFunc, with a GlobalVar to represent the call from into the new subroutine.
LowerDeviceKernelLaunch: For any subroutine call where the caller and callee are on different devices, replace with a device kernel launch.

Prior to this commit, the `TargetNode::host` could be specified in TVMScript as part of the config dictionary, under the key `"host"`. However, this required all other device parameters to be explicitly specified, rather than using any of the short-hand string representations. This commit forwards the `host` argument from TVMScript's `T.target` method to `tvm.target.Target`, allowing both the device and host to be specified using the shorthand string representation. ```python @T.prim_func def before_this_commit(): T.func_attr( { "target": T.target( { "arch": "sm_86", "host": {"keys": ["cpu"], "kind": "llvm", "tag": ""}, "keys": ["cuda", "gpu"], "kind": "cuda", "max_num_threads": 1024, "tag": "", "thread_warp_size": 32, } ) } ) T.evaluate(0) @T.prim_func def after_this_commit(): T.func_attr({"target": T.target("cuda", host="llvm")}) T.evaluate(0) ```

Previously, the symbol name of the extracted compute kernel was defined based on the `kGlobalSymbol` attribute, which was required to be present. This commit updates `SplitHostDevice` to generate the symbol name using `kGlobalSymbol` if present, and to fall back to the name of the `tvm::GlobalVar` for internal functions.

First pass, `AnnotateDeviceRegions`. This pass decides which portions of a PrimFunc should be run on the device, and annotates them with `kTarget` attribute, indicating which target should be used for later lowering steps. Second pass, `SplitHostDevice`. This pass extracts the annotated region into an independent PrimFunc. The `kTarget` attribute of the extracted kernel is defined by the `kTarget` annotation inserted by `AnnotateDeviceRegions`. The host function is marked by the `tvm::tir::attr::kIsHostFunc` attribute, allowing it to be recognized by later host-only lowering passes. Third pass, `LowerDeviceKernelLaunch`. This pass identifies subroutine calls that call into device kernels, and rewrites them into `T.tvm_call_packed`.

tvm-bot · 2023-05-22T18:03:08Z

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

cc @Hzfengsy, @junrushao, @quic-sanirudh, @shingjan _{See #10317 for details}

_{Generated by tvm-bot}

Lunderberg · 2023-05-22T18:04:22Z

This PR is a subset of the functionality in #14862. As that PR has grown to a difficult-to-review size, I'm carving off independent chunks of it into separate PRs, after which #14862 can be rebased to a more manageable size.

In addition, this PR depends on #14915 for the new unit tests, and should be rebased onto main after #14915 lands.

Previously, the SplitHostDevice pass added the `tir::attr::kKernelLaunchParams` attribute, and the LowerDeviceKernelLaunch pass filled in the values for it. This cleanup makes the kernel launch params be the sole responsibility of LowerDeviceKernelLaunch.

PRs apache#14913 and apache#14914 made analogous changes to `MakePackedAPI` and `MakeUnpackedAPI` to handle subroutine calls. Both PRs introduced the same symbol, `tvm::tir::SubroutineCallRewriter`, a local utility to update internal calls to a modified function. While each PR passed CI individually, and was therefore able to merge, having both changes caused a duplicate symbol. This commit updates `MakePackedAPI` and `MakeUnpackedAPI` to place their local utilities into anonymous namespaces, avoiding the conflict.

csullivan · 2023-05-25T16:40:40Z

src/tir/transforms/lower_device_kernel_launch.cc

+      bool is_host_func =
+          func->GetAttr<Bool>(tvm::tir::attr::kIsHostFunc).value_or(Bool(false))->value;


Re. discussion offline, do we want to update to not rely on kIsHostFunc?

Agreed, and updated. The SplitHostDevice and LowerDeviceKernelLaunch no longer use the kIsHostFunc attribute at all.

Update to use the `tvm::tir::IsHostFunc` utility function, rather than the `kIsHostFunc` attribute. Per discussion on apache#14020, the `kIsHostFunct` attribute should only be used in `BindTarget`, and should not be re-introduced in `SplitHostDevice`.

…_handle_subroutines_pr_14918

Lunderberg · 2023-05-25T17:41:18Z

(Merged #14950 into this PR to allow CI to run. No conflicts expected after CI, so github's squash/merge will remove it from the final commit.)

csullivan

LGTM, thank you for the refactoring! Looking forward to the additional functionality having SplitHostDevice split out into separate steps will enable during lowering.

tqchen · 2023-05-27T14:28:31Z

there seems to be a regression caused, or related to this PR, please look into it

https://ci.tlcpack.ai/blue/organizations/jenkins/tvm-gpu/detail/main/715/pipeline

Lunderberg · 2023-05-30T13:06:31Z

@tqchen Thank you for the heads up, and looking into it.

This resolves an issue introduced by the combination of apache#14918 and apache#14945. The bug occurred for targets that do not require device-side codegen, but do require a `device_type` other than `kDLCPU`. It wasn't caught by CI, as the issue only occurred with the combination of both PRs. 1. apache#14918 updated `SplitHostDevice` to only modify the `"target"` attribute when a device-side function has been extracted. 2. For VTA, there is no device-side function, as everything is done through host-side API calls. 3. From (1) and (2), the VTA examples kept the target `T.target("ext_dev", host="llvm")` after the `SplitHostDevice` pass, instead of being updated to `T.target("llvm")`. 4. apache#14945 restricted CombineContextCall to only apply to host-side passes. 5. From (4) and (5), the `CombineContextCall` pass was no longer applied to the VTA context calls. This PR fixes `SplitHostDevice`, updating the target from `T.target("ext_dev", host="llvm")` to `T.target("llvm")`, even if no device sections have been extracted from the function.

Lunderberg · 2023-05-30T14:22:34Z

Found the issue. It was a combination of this PR and #14945, which is why it wasn't caught by CI. A full description of the issue, and a bugfix, are in PR #14982.

The functionality tested in this commit was added across several recent PRs, each of which tested their features in isolation. This PR adds unit tests to validate the end-to-end behavior of TIR subroutine calls. PRs building up to this point: - TVMScript - apache#14889 - apache#14915 - apache#14919 - apache#14941 - Functionality improvements of existing TIR passes - apache#14913 - apache#14914 - apache#14918 - apache#14951 - Changes to the TIR lowering flow - apache#14942 - apache#14985 - Codegen updates - apache#14958 - apache#14901 - Compatibility updates/fixes - apache#14892 - apache#14950 - apache#14943 - apache#14944 - apache#14945 - apache#14952 - apache#14982 - apache#14949

…14982) This resolves an issue introduced by the combination of #14918 and #14945. The bug occurred for targets that do not require device-side codegen, but do require a `device_type` other than `kDLCPU`. It wasn't caught by CI, as the issue only occurred with the combination of both PRs. 1. #14918 updated `SplitHostDevice` to only modify the `"target"` attribute when a device-side function has been extracted. 2. For VTA, there is no device-side function, as everything is done through host-side API calls. 3. From (1) and (2), the VTA examples kept the target `T.target("ext_dev", host="llvm")` after the `SplitHostDevice` pass, instead of being updated to `T.target("llvm")`. 4. #14945 restricted CombineContextCall to only apply to host-side passes. 5. From (4) and (5), the `CombineContextCall` pass was no longer applied to the VTA context calls. This PR fixes `SplitHostDevice`, updating the target from `T.target("ext_dev", host="llvm")` to `T.target("llvm")`, even if no device sections have been extracted from the function.

This PR refactors SplitHostDevice into three separate transformations. Previously, SplitHostDevice would replace device regions with a builtin::tvm_call_packed() node to replace the extracted region. After this PR, this process is performed in three separate steps. AnnotateDeviceRegion: Annotate the regions that should be executed on another target. SplitHostDevice: Extract the annotated region into an independent PrimFunc, with a GlobalVar to represent the call from into the new subroutine. LowerDeviceKernelLaunch: For any subroutine call where the caller and callee are on different devices, replace with a device kernel launch. * PR#14915 [TVMScript] Allow T.target("device", host="host") in TVMScript Prior to this commit, the `TargetNode::host` could be specified in TVMScript as part of the config dictionary, under the key `"host"`. However, this required all other device parameters to be explicitly specified, rather than using any of the short-hand string representations. This commit forwards the `host` argument from TVMScript's `T.target` method to `tvm.target.Target`, allowing both the device and host to be specified using the shorthand string representation. ```python @T.prim_func def before_this_commit(): T.func_attr( { "target": T.target( { "arch": "sm_86", "host": {"keys": ["cpu"], "kind": "llvm", "tag": ""}, "keys": ["cuda", "gpu"], "kind": "cuda", "max_num_threads": 1024, "tag": "", "thread_warp_size": 32, } ) } ) T.evaluate(0) @T.prim_func def after_this_commit(): T.func_attr({"target": T.target("cuda", host="llvm")}) T.evaluate(0) ``` * [Target] Added WithoutHost method * [TIR] SplitHostDevice, handle missing kGlobalSymbol Previously, the symbol name of the extracted compute kernel was defined based on the `kGlobalSymbol` attribute, which was required to be present. This commit updates `SplitHostDevice` to generate the symbol name using `kGlobalSymbol` if present, and to fall back to the name of the `tvm::GlobalVar` for internal functions. * [TIR] Refactor SplitHostDevice into three separate passes First pass, `AnnotateDeviceRegions`. This pass decides which portions of a PrimFunc should be run on the device, and annotates them with `kTarget` attribute, indicating which target should be used for later lowering steps. Second pass, `SplitHostDevice`. This pass extracts the annotated region into an independent PrimFunc. The `kTarget` attribute of the extracted kernel is defined by the `kTarget` annotation inserted by `AnnotateDeviceRegions`. The host function is marked by the `tvm::tir::attr::kIsHostFunc` attribute, allowing it to be recognized by later host-only lowering passes. Third pass, `LowerDeviceKernelLaunch`. This pass identifies subroutine calls that call into device kernels, and rewrites them into `T.tvm_call_packed`. * Add unit tests specifically for SplitHostDevice behavior * Added unit test specifically for AnnotateDeviceRegions * Added unit tests for LowerDeviceKernelLaunch * Minor cleanup, moved all kernel launch collection into one spot Previously, the SplitHostDevice pass added the `tir::attr::kKernelLaunchParams` attribute, and the LowerDeviceKernelLaunch pass filled in the values for it. This cleanup makes the kernel launch params be the sole responsibility of LowerDeviceKernelLaunch. * Updated unit tests for LowerWarpMemory * Updated unit tests for ThreadSync * Updated unit test for inject ptx async copy * [Bugfix] Avoid symbol conflicts in MakePackedAPI/MakeUnpackedAPI PRs apache#14913 and apache#14914 made analogous changes to `MakePackedAPI` and `MakeUnpackedAPI` to handle subroutine calls. Both PRs introduced the same symbol, `tvm::tir::SubroutineCallRewriter`, a local utility to update internal calls to a modified function. While each PR passed CI individually, and was therefore able to merge, having both changes caused a duplicate symbol. This commit updates `MakePackedAPI` and `MakeUnpackedAPI` to place their local utilities into anonymous namespaces, avoiding the conflict. * Maintain "tir.is_global_func" attr in device-side entry point * SplitHostDevice, update the host-side target to be the host * [TIR] Update LowerDeviceKernelLaunch to avoid kIsHostFunc Update to use the `tvm::tir::IsHostFunc` utility function, rather than the `kIsHostFunc` attribute. Per discussion on apache#14020, the `kIsHostFunct` attribute should only be used in `BindTarget`, and should not be re-introduced in `SplitHostDevice`. * Remove is_host_func from SplitHostDevice tests

…pache#14982) This resolves an issue introduced by the combination of apache#14918 and apache#14945. The bug occurred for targets that do not require device-side codegen, but do require a `device_type` other than `kDLCPU`. It wasn't caught by CI, as the issue only occurred with the combination of both PRs. 1. apache#14918 updated `SplitHostDevice` to only modify the `"target"` attribute when a device-side function has been extracted. 2. For VTA, there is no device-side function, as everything is done through host-side API calls. 3. From (1) and (2), the VTA examples kept the target `T.target("ext_dev", host="llvm")` after the `SplitHostDevice` pass, instead of being updated to `T.target("llvm")`. 4. apache#14945 restricted CombineContextCall to only apply to host-side passes. 5. From (4) and (5), the `CombineContextCall` pass was no longer applied to the VTA context calls. This PR fixes `SplitHostDevice`, updating the target from `T.target("ext_dev", host="llvm")` to `T.target("llvm")`, even if no device sections have been extracted from the function.

* [Bugfix][TIR][VTA] Update host-side target, even without device func This resolves an issue introduced by the combination of #14918 and #14945. The bug occurred for targets that do not require device-side codegen, but do require a `device_type` other than `kDLCPU`. It wasn't caught by CI, as the issue only occurred with the combination of both PRs. 1. #14918 updated `SplitHostDevice` to only modify the `"target"` attribute when a device-side function has been extracted. 2. For VTA, there is no device-side function, as everything is done through host-side API calls. 3. From (1) and (2), the VTA examples kept the target `T.target("ext_dev", host="llvm")` after the `SplitHostDevice` pass, instead of being updated to `T.target("llvm")`. 4. #14945 restricted CombineContextCall to only apply to host-side passes. 5. From (4) and (5), the `CombineContextCall` pass was no longer applied to the VTA context calls. This PR fixes `SplitHostDevice`, updating the target from `T.target("ext_dev", host="llvm")` to `T.target("llvm")`, even if no device sections have been extracted from the function. * [CodegenC] Updated unit test for sorted CodegenC output Previously, this unit test generated a `Map<tvm::Target, IRModule>` whose default iteration order was not sorted by function name, built the `Map` of modules, then validated whether the resulting C code was a sorted list of 4 elements. However, this condition was stricter than necessary, as it depended on the number of items added to the `Map` until it was unsorted. This commit updates the test to instead validate that `std::is_sorted` returns true. * Ignore __tvm_main__ in unit test

* [Bugfix][TIR][VTA] Update host-side target, even without device func This resolves an issue introduced by the combination of #14918 and #14945. The bug occurred for targets that do not require device-side codegen, but do require a `device_type` other than `kDLCPU`. It wasn't caught by CI, as the issue only occurred with the combination of both PRs. 1. #14918 updated `SplitHostDevice` to only modify the `"target"` attribute when a device-side function has been extracted. 2. For VTA, there is no device-side function, as everything is done through host-side API calls. 3. From (1) and (2), the VTA examples kept the target `T.target("ext_dev", host="llvm")` after the `SplitHostDevice` pass, instead of being updated to `T.target("llvm")`. 4. #14945 restricted CombineContextCall to only apply to host-side passes. 5. From (4) and (5), the `CombineContextCall` pass was no longer applied to the VTA context calls. This PR fixes `SplitHostDevice`, updating the target from `T.target("ext_dev", host="llvm")` to `T.target("llvm")`, even if no device sections have been extracted from the function. * [TIR] Restrict tir.transform.LowerTVMBuiltin to host functions Previously, the `tir.transform.LowerTVMBuiltin` pass applied to all functions in an `IRModule`, but was only applied to modules that contain only host functions. This commit updates `tir.transform.LowerTVMBuiltin` to apply only to host functions. * Updated "stackvm" target to have "cpu" key. With the presence/absence of the "cpu" key in a target used to determine whether host-only calls should be run, should make sure to add it to "stackvm". * Update IsHostFunc() to use "host" tag instead of "cpu" Current CI failures due to LowerTVMBuiltin not running on "hexagon" target, and would like to avoid conflating cpu/host. * Avoid "host" tag for now * Update HEXAGON_AOT_LLVM_TARGET to be recognized as host

The functionality tested in this commit was added across several recent PRs, each of which tested their features in isolation. This PR adds unit tests to validate the end-to-end behavior of TIR subroutine calls. PRs building up to this point: - TVMScript - apache#14889 - apache#14915 - apache#14919 - apache#14941 - Functionality improvements of existing TIR passes - apache#14913 - apache#14914 - apache#14918 - apache#14951 - Changes to the TIR lowering flow - apache#14942 - apache#14985 - Codegen updates - apache#14958 - apache#14901 - Compatibility updates/fixes - apache#14892 - apache#14950 - apache#14943 - apache#14944 - apache#14945 - apache#14952 - apache#14982 - apache#14949

* [Bugfix][TIR][VTA] Update host-side target, even without device func This resolves an issue introduced by the combination of apache#14918 and apache#14945. The bug occurred for targets that do not require device-side codegen, but do require a `device_type` other than `kDLCPU`. It wasn't caught by CI, as the issue only occurred with the combination of both PRs. 1. apache#14918 updated `SplitHostDevice` to only modify the `"target"` attribute when a device-side function has been extracted. 2. For VTA, there is no device-side function, as everything is done through host-side API calls. 3. From (1) and (2), the VTA examples kept the target `T.target("ext_dev", host="llvm")` after the `SplitHostDevice` pass, instead of being updated to `T.target("llvm")`. 4. apache#14945 restricted CombineContextCall to only apply to host-side passes. 5. From (4) and (5), the `CombineContextCall` pass was no longer applied to the VTA context calls. This PR fixes `SplitHostDevice`, updating the target from `T.target("ext_dev", host="llvm")` to `T.target("llvm")`, even if no device sections have been extracted from the function. * [CodegenC] Updated unit test for sorted CodegenC output Previously, this unit test generated a `Map<tvm::Target, IRModule>` whose default iteration order was not sorted by function name, built the `Map` of modules, then validated whether the resulting C code was a sorted list of 4 elements. However, this condition was stricter than necessary, as it depended on the number of items added to the `Map` until it was unsorted. This commit updates the test to instead validate that `std::is_sorted` returns true. * Ignore __tvm_main__ in unit test

…e#14944) * [Bugfix][TIR][VTA] Update host-side target, even without device func This resolves an issue introduced by the combination of apache#14918 and apache#14945. The bug occurred for targets that do not require device-side codegen, but do require a `device_type` other than `kDLCPU`. It wasn't caught by CI, as the issue only occurred with the combination of both PRs. 1. apache#14918 updated `SplitHostDevice` to only modify the `"target"` attribute when a device-side function has been extracted. 2. For VTA, there is no device-side function, as everything is done through host-side API calls. 3. From (1) and (2), the VTA examples kept the target `T.target("ext_dev", host="llvm")` after the `SplitHostDevice` pass, instead of being updated to `T.target("llvm")`. 4. apache#14945 restricted CombineContextCall to only apply to host-side passes. 5. From (4) and (5), the `CombineContextCall` pass was no longer applied to the VTA context calls. This PR fixes `SplitHostDevice`, updating the target from `T.target("ext_dev", host="llvm")` to `T.target("llvm")`, even if no device sections have been extracted from the function. * [TIR] Restrict tir.transform.LowerTVMBuiltin to host functions Previously, the `tir.transform.LowerTVMBuiltin` pass applied to all functions in an `IRModule`, but was only applied to modules that contain only host functions. This commit updates `tir.transform.LowerTVMBuiltin` to apply only to host functions. * Updated "stackvm" target to have "cpu" key. With the presence/absence of the "cpu" key in a target used to determine whether host-only calls should be run, should make sure to add it to "stackvm". * Update IsHostFunc() to use "host" tag instead of "cpu" Current CI failures due to LowerTVMBuiltin not running on "hexagon" target, and would like to avoid conflating cpu/host. * Avoid "host" tag for now * Update HEXAGON_AOT_LLVM_TARGET to be recognized as host

The functionality tested in this commit was added across several recent PRs, each of which tested their features in isolation. This PR adds unit tests to validate the end-to-end behavior of TIR subroutine calls. PRs building up to this point: - TVMScript - apache#14889 - apache#14915 - apache#14919 - apache#14941 - Functionality improvements of existing TIR passes - apache#14913 - apache#14914 - apache#14918 - apache#14951 - Changes to the TIR lowering flow - apache#14942 - apache#14985 - Codegen updates - apache#14958 - apache#14901 - Compatibility updates/fixes - apache#14892 - apache#14950 - apache#14943 - apache#14944 - apache#14945 - apache#14952 - apache#14982 - apache#14949

tqchen · 2023-12-14T04:01:26Z

A note that we should revisit some of the assumptions #16237

Lunderberg added 7 commits May 22, 2023 09:55

[Target] Added WithoutHost method

dfda99a

Add unit tests specifically for SplitHostDevice behavior

f973841

Added unit test specifically for AnnotateDeviceRegions

95ae88e

Added unit tests for LowerDeviceKernelLaunch

f3c87a5

Lunderberg force-pushed the split_host_device_handle_subroutines branch from 7dabdd0 to a856076 Compare May 22, 2023 19:44

Lunderberg added 4 commits May 23, 2023 08:25

Updated unit tests for LowerWarpMemory

455c7a8

Updated unit tests for ThreadSync

fcdfaf2

Updated unit test for inject ptx async copy

36b5657

csullivan reviewed May 25, 2023

View reviewed changes

Lunderberg added 5 commits May 25, 2023 12:06

Maintain "tir.is_global_func" attr in device-side entry point

96dc763

SplitHostDevice, update the host-side target to be the host

2c7c605

Remove is_host_func from SplitHostDevice tests

30d2821

Merge branch 'hotfix_symbol_conflict_pr_14950' into split_host_device…

7cba585

…_handle_subroutines_pr_14918

csullivan approved these changes May 26, 2023

View reviewed changes

csullivan merged commit 6eb0779 into apache:main May 26, 2023

Lunderberg mentioned this pull request May 30, 2023

[Bugfix][TIR][VTA] Update host-side target, even without device func #14982

Merged

Lunderberg deleted the split_host_device_handle_subroutines branch May 30, 2023 14:22

Lunderberg mentioned this pull request May 31, 2023

[Hexagon] Remove "c" as aot_host_target tvm/contrib/hexagon/pytest_pl… #14997

Merged

ysh329 mentioned this pull request Jul 12, 2023

[Release] v0.13.0 Release Candidate Notes #15295

Closed

tqchen mentioned this pull request Dec 14, 2023

[Bug] SplitHostDevice generates free var when var only exists in T.thread_binding of device function #16237

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TIR] SplitHostDevice, handle subroutines #14918

[TIR] SplitHostDevice, handle subroutines #14918

Lunderberg commented May 22, 2023

tvm-bot commented May 22, 2023

Lunderberg commented May 22, 2023

csullivan May 25, 2023

Lunderberg May 25, 2023

Lunderberg commented May 25, 2023

csullivan left a comment

tqchen commented May 27, 2023 •

edited

Loading

Lunderberg commented May 30, 2023

Lunderberg commented May 30, 2023

tqchen commented Dec 14, 2023

		bool is_host_func =
		func->GetAttr<Bool>(tvm::tir::attr::kIsHostFunc).value_or(Bool(false))->value;

[TIR] SplitHostDevice, handle subroutines #14918

[TIR] SplitHostDevice, handle subroutines #14918

Conversation

Lunderberg commented May 22, 2023

tvm-bot commented May 22, 2023

Lunderberg commented May 22, 2023

csullivan May 25, 2023

Choose a reason for hiding this comment

Lunderberg May 25, 2023

Choose a reason for hiding this comment

Lunderberg commented May 25, 2023

csullivan left a comment

Choose a reason for hiding this comment

tqchen commented May 27, 2023 • edited Loading

Lunderberg commented May 30, 2023

Lunderberg commented May 30, 2023

tqchen commented Dec 14, 2023

tqchen commented May 27, 2023 •

edited

Loading