Enhancing Random Kernel Launch with Updated CUDA Graph Tools and a Modular CUDA Graph Layer #58310

eee4017 · 2023-10-23T02:33:50Z

PR types

New features

PR changes

APIs

Description

Launching CUDA kernels with the `CUDAGraphNodeLauncher`

To better handle CUDA kernels that use random seeds in CUDA Graph, we're introducing the CUDAGraphNodeLauncher. This new method redesigned the older PD_RECORD_CUDA_GRAPH_RANDOM_KERNEL technique. It uses the cuGraphExecKernelNodeSetParams method to set a random seed for every CUDA kernel.

A special aspect of this method is that the first input for any kernel using this tool needs to be an unsigned int id. This number helps link the CUDA kernel to its place in the CUDA graph. By giving each kernel its own unique number, we can keep things organized and ensure every kernel is correctly linked in the graph.

Previously, with the PD_RECORD_CUDA_GRAPH_RANDOM_KERNEL method, parameter comparisons were performed using a bitwise comparison, as can be observed here. However, this method of comparison exhibited two primary shortcomings:

Inconsistent Parameter Capture: If a node possesses certain dynamic parameters, these might not always be captured correctly. A case in point is this instance where a templated callable functor serves as a parameter. In such scenarios, the bitwise comparison method failed.
Lack of Distinction for Identical Launches: The bitwise comparison method cannot distinguish kernel launches that have been initiated multiple times with identical parameters.

Modular CUDA Graph Layer: `CUDAGraphedLayer`

To speed up tasks that repeat often, we've added the CUDAGraphedLayer. This tool changes a standard PaddlePaddle model into one that uses CUDA Graphs for faster performance. The CUDAGraphedLayer wraps around PaddlePaddle models (nn.Layers). This means PaddlePaddle models can now use CUDA Graphs without any extra work, benefiting from the speed improvements they bring.

paddle-bot · 2023-10-23T02:33:55Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

eee4017 · 2023-10-23T02:35:24Z

paddle/phi/backends/gpu/cuda/cuda_graph.h

+};
+
+using cudaGraphExecuterSetter_t = std::function<void(cudaGraphExec_t)>;
+


Kindly review the documentation for CUDAGraphNodeLauncher.

eee4017 · 2023-10-23T02:35:44Z

python/paddle/device/cuda/cuda_graphed_layer.py

+
+
+class CUDAGraphedLayer(paddle.nn.Layer):
+    """


Kindly review the documentation for CUDAGraphedLayer.

eee4017 · 2023-10-23T02:36:20Z

paddle/phi/backends/gpu/cuda/cuda_graph.h

+//      kernel<<<>>>(id, ...);  // Launching the kernel with id
+//  };
+//
+//  [Retrieving CUDA Function]


The API is intended to be invoked within each library, such as user-defined operators. Consequently, users of CUDAGraphNodeLauncher are accountable for obtaining the cudaFunction_t (CUFunction) structure, and this API should not be directly called by us (in the CUDAGraphNodeLauncher class).

eee4017 · 2023-10-23T02:48:17Z

test/legacy_test/test_cuda_graphed_layer.py

+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+


This unit test verifies the functionality of CUDAGraphedLayer and also the random kernel mechanism.

eee4017 · 2023-10-23T04:32:55Z

paddle/phi/backends/gpu/cuda/cuda_graph.cc

+  for (auto node : nodes) {
+    CUgraphNode cuNode = node;
+    CUgraphNodeType pType;
+    PADDLE_ENFORCE_GPU_SUCCESS(dynload::cuGraphNodeGetType(cuNode, &pType));


The Importance of the CUDA Driver API in CUDAGraphNodeLauncher

Our revised implementation underscores the importance of the CUDA driver API. In the original code, we relied on the cudaGraphKernelNodeGetParams from the CUDA runtime API to retrieve node parameters. This approach, however, occasionally resulted in the cudaErrorInvalidDeviceFunction error. Kindly refer to the code here.

This error arises when attempting to retrieve a node from another shared library - such as CUDNN kernels or user-defined kernels. In the realm of the CUDA driver API, a shared library is represented by the CUlibrary handle, details of which can be found here. Contrarily, the CUDA runtime API simplifies and hides this structure from the user, primarily presuming that kernel function pointers originate from the same library. This assumption leads to the aforementioned error when accessing kernel function pointers from distinct libraries. Direct engagement with the CUDA driver API avoid this issue.

It's crucial to note that, with this change, users of the CUDAGraphNodeLauncher are responsible for getting the CUFunction structure, particularly since the kernel could reside in diverse libraries.

XieYunshen

LGTM

onecatcn · 2023-10-26T02:59:31Z

2023-10-24 20:07:48 0. You must have raindrops2sea or XiaoguangHu01 approval for change 20+ files or add than 1000+ lines of content.
2023-10-24 20:07:48 1. Unittest is not allowed to be disabled.
2023-10-24 20:07:48 You must have one RD (kolinwei(Recommend), wanghuancoder, luotao1, QingshuChen, qili93 or ZzSean or Aurelius84) approval for the usage of @unittest.skip or @unittest.skipIf.

XieYunshen

LGTM

wanghuancoder

LGTM for @unittest.skipIf

XiaoguangHu01

LGTM

zyfncg · 2023-10-30T06:32:56Z

python/paddle/device/cuda/cuda_graphed_layer.py

@@ -0,0 +1,146 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.


2021 -> 2023

…dular CUDA Graph Layer (PaddlePaddle#58310) * Proposal to fix CUDA Graph Random Kernel Issue * fix template linking * fix test_cuda_graph_partial_graph_static_run * rewrite CUDAGraphNodeLauncher using lambda CallBack * use cuda driver API and use cudaGetFuncBySymbol * use cuda dyload driver; document * add cuda_graphed_layer module * add cuda_graphed_layer module * add UT; add Doc; pre-commit * pre-commit * remove obsolete code; add cuda version check * add dummy cudaGetFuncBySymbol * add dummy cudaGetFuncBySymbol * add dummy cudaGetFuncBySymbol * cmake test rules * cmake format * Check CUDA Version test_standalone_cuda_graph_multi_stream * cmake format * test_standalone_cuda_graph_multi_stream * use skipif instread of cmake * test_cuda_graph_partial_graph_static_run * rm stream_safe_cuda_alloc_test

eee4017 added 10 commits October 23, 2023 02:15

Proposal to fix CUDA Graph Random Kernel Issue

dbe0c54

fix template linking

97e0766

fix test_cuda_graph_partial_graph_static_run

7aa0081

rewrite CUDAGraphNodeLauncher using lambda CallBack

97c0e89

use cuda driver API and use cudaGetFuncBySymbol

aee174b

use cuda dyload driver; document

a41aafb

add cuda_graphed_layer module

afee33a

add cuda_graphed_layer module

a4a3f9d

add UT; add Doc; pre-commit

7668e8e

pre-commit

6b0b429

paddle-bot bot added the contributor External developers label Oct 23, 2023

eee4017 commented Oct 23, 2023

View reviewed changes

jeng1220 added the NVIDIA label Oct 23, 2023

eee4017 added 3 commits October 23, 2023 06:32

remove obsolete code; add cuda version check

e9b21eb

add dummy cudaGetFuncBySymbol

c5068e1

add dummy cudaGetFuncBySymbol

859ee44

onecatcn assigned zyfncg Oct 23, 2023

eee4017 added 8 commits October 23, 2023 07:33

add dummy cudaGetFuncBySymbol

35d6e19

cmake test rules

5ae7887

cmake format

48bcc53

Check CUDA Version test_standalone_cuda_graph_multi_stream

6c84980

cmake format

1f46235

test_standalone_cuda_graph_multi_stream

8a8e116

use skipif instread of cmake

007d86c

test_cuda_graph_partial_graph_static_run

5eefc55

XieYunshen previously approved these changes Oct 24, 2023

View reviewed changes

rm stream_safe_cuda_alloc_test

32a9cef

eee4017 dismissed XieYunshen’s stale review via 32a9cef October 24, 2023 12:00

onecatcn requested a review from XiaoguangHu01 October 26, 2023 02:59

XieYunshen approved these changes Oct 26, 2023

View reviewed changes

onecatcn requested a review from wanghuancoder October 26, 2023 03:37

wanghuancoder approved these changes Oct 26, 2023

View reviewed changes

XiaoguangHu01 approved these changes Oct 30, 2023

View reviewed changes

zyfncg approved these changes Oct 30, 2023

View reviewed changes

zyfncg merged commit c633e52 into PaddlePaddle:develop Oct 30, 2023
28 checks passed

eee4017 mentioned this pull request Nov 10, 2023

Enhanced RNG State Management with Index-Based Control for Graph-Safe Tensor Parallelism #58859

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancing Random Kernel Launch with Updated CUDA Graph Tools and a Modular CUDA Graph Layer #58310

Enhancing Random Kernel Launch with Updated CUDA Graph Tools and a Modular CUDA Graph Layer #58310

eee4017 commented Oct 23, 2023 •

edited

Loading

paddle-bot bot commented Oct 23, 2023

eee4017 Oct 23, 2023

eee4017 Oct 23, 2023

eee4017 Oct 23, 2023

eee4017 Oct 23, 2023

eee4017 Oct 23, 2023 •

edited

Loading

XieYunshen left a comment

onecatcn commented Oct 26, 2023

XieYunshen left a comment

wanghuancoder left a comment

XiaoguangHu01 left a comment

zyfncg Oct 30, 2023

		};

		using cudaGraphExecuterSetter_t = std::function<void(cudaGraphExec_t)>;

		@@ -0,0 +1,146 @@
		# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.

Enhancing Random Kernel Launch with Updated CUDA Graph Tools and a Modular CUDA Graph Layer #58310

Enhancing Random Kernel Launch with Updated CUDA Graph Tools and a Modular CUDA Graph Layer #58310

Conversation

eee4017 commented Oct 23, 2023 • edited Loading

PR types

PR changes

Description

Launching CUDA kernels with the CUDAGraphNodeLauncher

Modular CUDA Graph Layer: CUDAGraphedLayer

paddle-bot bot commented Oct 23, 2023

eee4017 Oct 23, 2023

Choose a reason for hiding this comment

eee4017 Oct 23, 2023

Choose a reason for hiding this comment

eee4017 Oct 23, 2023

Choose a reason for hiding this comment

eee4017 Oct 23, 2023

Choose a reason for hiding this comment

eee4017 Oct 23, 2023 • edited Loading

Choose a reason for hiding this comment

The Importance of the CUDA Driver API in CUDAGraphNodeLauncher

XieYunshen left a comment

Choose a reason for hiding this comment

onecatcn commented Oct 26, 2023

XieYunshen left a comment

Choose a reason for hiding this comment

wanghuancoder left a comment

Choose a reason for hiding this comment

XiaoguangHu01 left a comment

Choose a reason for hiding this comment

zyfncg Oct 30, 2023

Choose a reason for hiding this comment

eee4017 commented Oct 23, 2023 •

edited

Loading

Launching CUDA kernels with the `CUDAGraphNodeLauncher`

Modular CUDA Graph Layer: `CUDAGraphedLayer`

eee4017 Oct 23, 2023 •

edited

Loading

The Importance of the CUDA Driver API in `CUDAGraphNodeLauncher`