Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancing Random Kernel Launch with Updated CUDA Graph Tools and a Modular CUDA Graph Layer #58310

Conversation

eee4017
Copy link
Contributor

@eee4017 eee4017 commented Oct 23, 2023

PR types

New features

PR changes

APIs

Description

Launching CUDA kernels with the CUDAGraphNodeLauncher

To better handle CUDA kernels that use random seeds in CUDA Graph, we're introducing the CUDAGraphNodeLauncher. This new method redesigned the older PD_RECORD_CUDA_GRAPH_RANDOM_KERNEL technique. It uses the cuGraphExecKernelNodeSetParams method to set a random seed for every CUDA kernel.

A special aspect of this method is that the first input for any kernel using this tool needs to be an unsigned int id. This number helps link the CUDA kernel to its place in the CUDA graph. By giving each kernel its own unique number, we can keep things organized and ensure every kernel is correctly linked in the graph.

Previously, with the PD_RECORD_CUDA_GRAPH_RANDOM_KERNEL method, parameter comparisons were performed using a bitwise comparison, as can be observed here. However, this method of comparison exhibited two primary shortcomings:

  • Inconsistent Parameter Capture: If a node possesses certain dynamic parameters, these might not always be captured correctly. A case in point is this instance where a templated callable functor serves as a parameter. In such scenarios, the bitwise comparison method failed.

  • Lack of Distinction for Identical Launches: The bitwise comparison method cannot distinguish kernel launches that have been initiated multiple times with identical parameters.

Modular CUDA Graph Layer: CUDAGraphedLayer

To speed up tasks that repeat often, we've added the CUDAGraphedLayer. This tool changes a standard PaddlePaddle model into one that uses CUDA Graphs for faster performance. The CUDAGraphedLayer wraps around PaddlePaddle models (nn.Layers). This means PaddlePaddle models can now use CUDA Graphs without any extra work, benefiting from the speed improvements they bring.

@paddle-bot
Copy link

paddle-bot bot commented Oct 23, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added the contributor External developers label Oct 23, 2023
};

using cudaGraphExecuterSetter_t = std::function<void(cudaGraphExec_t)>;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kindly review the documentation for CUDAGraphNodeLauncher.



class CUDAGraphedLayer(paddle.nn.Layer):
"""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kindly review the documentation for CUDAGraphedLayer.

// kernel<<<>>>(id, ...); // Launching the kernel with id
// };
//
// [Retrieving CUDA Function]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API is intended to be invoked within each library, such as user-defined operators. Consequently, users of CUDAGraphNodeLauncher are accountable for obtaining the cudaFunction_t (CUFunction) structure, and this API should not be directly called by us (in the CUDAGraphNodeLauncher class).

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unit test verifies the functionality of CUDAGraphedLayer and also the random kernel mechanism.

for (auto node : nodes) {
CUgraphNode cuNode = node;
CUgraphNodeType pType;
PADDLE_ENFORCE_GPU_SUCCESS(dynload::cuGraphNodeGetType(cuNode, &pType));
Copy link
Contributor Author

@eee4017 eee4017 Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Importance of the CUDA Driver API in CUDAGraphNodeLauncher

Our revised implementation underscores the importance of the CUDA driver API. In the original code, we relied on the cudaGraphKernelNodeGetParams from the CUDA runtime API to retrieve node parameters. This approach, however, occasionally resulted in the cudaErrorInvalidDeviceFunction error. Kindly refer to the code here.

This error arises when attempting to retrieve a node from another shared library - such as CUDNN kernels or user-defined kernels. In the realm of the CUDA driver API, a shared library is represented by the CUlibrary handle, details of which can be found here. Contrarily, the CUDA runtime API simplifies and hides this structure from the user, primarily presuming that kernel function pointers originate from the same library. This assumption leads to the aforementioned error when accessing kernel function pointers from distinct libraries. Direct engagement with the CUDA driver API avoid this issue.

It's crucial to note that, with this change, users of the CUDAGraphNodeLauncher are responsible for getting the CUFunction structure, particularly since the kernel could reside in diverse libraries.

XieYunshen
XieYunshen previously approved these changes Oct 24, 2023
Copy link
Contributor

@XieYunshen XieYunshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@onecatcn
Copy link
Contributor

2023-10-24 20:07:48 0. You must have raindrops2sea or XiaoguangHu01 approval for change 20+ files or add than 1000+ lines of content.
2023-10-24 20:07:48 1. Unittest is not allowed to be disabled.
2023-10-24 20:07:48 You must have one RD (kolinwei(Recommend), wanghuancoder, luotao1, QingshuChen, qili93 or ZzSean or Aurelius84) approval for the usage of @unittest.skip or @unittest.skipIf.

Copy link
Contributor

@XieYunshen XieYunshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@wanghuancoder wanghuancoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for @unittest.skipIf

Copy link
Contributor

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -0,0 +1,146 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2021 -> 2023

@zyfncg zyfncg merged commit c633e52 into PaddlePaddle:develop Oct 30, 2023
28 checks passed
zeroRains pushed a commit to zeroRains/Paddle that referenced this pull request Nov 8, 2023
…dular CUDA Graph Layer (PaddlePaddle#58310)

* Proposal to fix CUDA Graph Random Kernel Issue

* fix template linking

* fix test_cuda_graph_partial_graph_static_run

* rewrite CUDAGraphNodeLauncher using lambda CallBack

* use cuda driver API and use cudaGetFuncBySymbol

* use cuda dyload driver; document

* add cuda_graphed_layer module

* add cuda_graphed_layer module

* add UT; add Doc; pre-commit

* pre-commit

* remove obsolete code; add cuda version check

* add dummy cudaGetFuncBySymbol

* add dummy cudaGetFuncBySymbol

* add dummy cudaGetFuncBySymbol

* cmake test rules

* cmake format

* Check CUDA Version test_standalone_cuda_graph_multi_stream

* cmake format

* test_standalone_cuda_graph_multi_stream

* use skipif instread of cmake

* test_cuda_graph_partial_graph_static_run

* rm stream_safe_cuda_alloc_test
danleifeng pushed a commit to danleifeng/Paddle that referenced this pull request Nov 14, 2023
…dular CUDA Graph Layer (PaddlePaddle#58310)

* Proposal to fix CUDA Graph Random Kernel Issue

* fix template linking

* fix test_cuda_graph_partial_graph_static_run

* rewrite CUDAGraphNodeLauncher using lambda CallBack

* use cuda driver API and use cudaGetFuncBySymbol

* use cuda dyload driver; document

* add cuda_graphed_layer module

* add cuda_graphed_layer module

* add UT; add Doc; pre-commit

* pre-commit

* remove obsolete code; add cuda version check

* add dummy cudaGetFuncBySymbol

* add dummy cudaGetFuncBySymbol

* add dummy cudaGetFuncBySymbol

* cmake test rules

* cmake format

* Check CUDA Version test_standalone_cuda_graph_multi_stream

* cmake format

* test_standalone_cuda_graph_multi_stream

* use skipif instread of cmake

* test_cuda_graph_partial_graph_static_run

* rm stream_safe_cuda_alloc_test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor External developers NVIDIA
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants