-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancing Random Kernel Launch with Updated CUDA Graph Tools and a Modular CUDA Graph Layer #58310
Enhancing Random Kernel Launch with Updated CUDA Graph Tools and a Modular CUDA Graph Layer #58310
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
}; | ||
|
||
using cudaGraphExecuterSetter_t = std::function<void(cudaGraphExec_t)>; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kindly review the documentation for CUDAGraphNodeLauncher.
|
||
|
||
class CUDAGraphedLayer(paddle.nn.Layer): | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kindly review the documentation for CUDAGraphedLayer.
// kernel<<<>>>(id, ...); // Launching the kernel with id | ||
// }; | ||
// | ||
// [Retrieving CUDA Function] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API is intended to be invoked within each library, such as user-defined operators. Consequently, users of CUDAGraphNodeLauncher are accountable for obtaining the cudaFunction_t (CUFunction) structure, and this API should not be directly called by us (in the CUDAGraphNodeLauncher class).
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This unit test verifies the functionality of CUDAGraphedLayer
and also the random kernel mechanism.
for (auto node : nodes) { | ||
CUgraphNode cuNode = node; | ||
CUgraphNodeType pType; | ||
PADDLE_ENFORCE_GPU_SUCCESS(dynload::cuGraphNodeGetType(cuNode, &pType)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Importance of the CUDA Driver API in CUDAGraphNodeLauncher
Our revised implementation underscores the importance of the CUDA driver API. In the original code, we relied on the cudaGraphKernelNodeGetParams
from the CUDA runtime API to retrieve node parameters. This approach, however, occasionally resulted in the cudaErrorInvalidDeviceFunction
error. Kindly refer to the code here.
This error arises when attempting to retrieve a node from another shared library - such as CUDNN kernels or user-defined kernels. In the realm of the CUDA driver API, a shared library is represented by the CUlibrary
handle, details of which can be found here. Contrarily, the CUDA runtime API simplifies and hides this structure from the user, primarily presuming that kernel function pointers originate from the same library. This assumption leads to the aforementioned error when accessing kernel function pointers from distinct libraries. Direct engagement with the CUDA driver API avoid this issue.
It's crucial to note that, with this change, users of the CUDAGraphNodeLauncher
are responsible for getting the CUFunction
structure, particularly since the kernel could reside in diverse libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
2023-10-24 20:07:48 0. You must have raindrops2sea or XiaoguangHu01 approval for change 20+ files or add than 1000+ lines of content. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for @unittest.skipIf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -0,0 +1,146 @@ | |||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2021 -> 2023
…dular CUDA Graph Layer (PaddlePaddle#58310) * Proposal to fix CUDA Graph Random Kernel Issue * fix template linking * fix test_cuda_graph_partial_graph_static_run * rewrite CUDAGraphNodeLauncher using lambda CallBack * use cuda driver API and use cudaGetFuncBySymbol * use cuda dyload driver; document * add cuda_graphed_layer module * add cuda_graphed_layer module * add UT; add Doc; pre-commit * pre-commit * remove obsolete code; add cuda version check * add dummy cudaGetFuncBySymbol * add dummy cudaGetFuncBySymbol * add dummy cudaGetFuncBySymbol * cmake test rules * cmake format * Check CUDA Version test_standalone_cuda_graph_multi_stream * cmake format * test_standalone_cuda_graph_multi_stream * use skipif instread of cmake * test_cuda_graph_partial_graph_static_run * rm stream_safe_cuda_alloc_test
…dular CUDA Graph Layer (PaddlePaddle#58310) * Proposal to fix CUDA Graph Random Kernel Issue * fix template linking * fix test_cuda_graph_partial_graph_static_run * rewrite CUDAGraphNodeLauncher using lambda CallBack * use cuda driver API and use cudaGetFuncBySymbol * use cuda dyload driver; document * add cuda_graphed_layer module * add cuda_graphed_layer module * add UT; add Doc; pre-commit * pre-commit * remove obsolete code; add cuda version check * add dummy cudaGetFuncBySymbol * add dummy cudaGetFuncBySymbol * add dummy cudaGetFuncBySymbol * cmake test rules * cmake format * Check CUDA Version test_standalone_cuda_graph_multi_stream * cmake format * test_standalone_cuda_graph_multi_stream * use skipif instread of cmake * test_cuda_graph_partial_graph_static_run * rm stream_safe_cuda_alloc_test
PR types
New features
PR changes
APIs
Description
Launching CUDA kernels with the
CUDAGraphNodeLauncher
To better handle CUDA kernels that use random seeds in CUDA Graph, we're introducing the
CUDAGraphNodeLauncher
. This new method redesigned the olderPD_RECORD_CUDA_GRAPH_RANDOM_KERNEL
technique. It uses thecuGraphExecKernelNodeSetParams
method to set a random seed for every CUDA kernel.A special aspect of this method is that the first input for any kernel using this tool needs to be an unsigned int id. This number helps link the CUDA kernel to its place in the CUDA graph. By giving each kernel its own unique number, we can keep things organized and ensure every kernel is correctly linked in the graph.
Previously, with the
PD_RECORD_CUDA_GRAPH_RANDOM_KERNEL
method, parameter comparisons were performed using a bitwise comparison, as can be observed here. However, this method of comparison exhibited two primary shortcomings:Inconsistent Parameter Capture: If a node possesses certain dynamic parameters, these might not always be captured correctly. A case in point is this instance where a templated callable functor serves as a parameter. In such scenarios, the bitwise comparison method failed.
Lack of Distinction for Identical Launches: The bitwise comparison method cannot distinguish kernel launches that have been initiated multiple times with identical parameters.
Modular CUDA Graph Layer:
CUDAGraphedLayer
To speed up tasks that repeat often, we've added the
CUDAGraphedLayer
. This tool changes a standard PaddlePaddle model into one that uses CUDA Graphs for faster performance. TheCUDAGraphedLayer
wraps around PaddlePaddle models (nn.Layers
). This means PaddlePaddle models can now use CUDA Graphs without any extra work, benefiting from the speed improvements they bring.