-
Notifications
You must be signed in to change notification settings - Fork 921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] cudf::binary_operation
ignores cuda context when registering JIT compiled PTX
#5133
Comments
Any idea how the calling thread's context is being ignored here? Is this a case where a thread is being created without an explicit context and CUDA is auto-selecting a (potentially incorrect) device when it implicitly initializes the context? If so that will cause problems in Spark with the RAPIDS plugins in a multi-GPU setup where the GPU device is assigned at the application level (not through |
This is happening in Jit where the compiled kernel is being registered with only one context. On a subsequent call from a different context, this fails. It would affect cases where the different threads are assigned different devices, as is the case with @magnatelee's usage. If spark uses one libcudf process per GPU then this won't affect it. It if uses one thread per GPU then it will. |
I'm investigating a fix such that the in-memory cache is stored per context. |
Ah, great to hear. The Spark RAPIDS plugin currently only uses one GPU per process. |
So what is left in this bug, @devavret ? |
The issue also asks for a check
I implemented this in NVIDIA/jitify#67 and after that the |
@devavret @magnatelee is this still an issue? |
Not a lot is left. Just needs to replace the Line 97 in 2780a8c
safe_launch . I'll make a quick PR tomorrow.
|
Hi! How is this going? |
I had a branch for it but I can't find it anymore. Must be lost in my corrupted git. Here's a new one #7510 |
Final step, closes rapidsai#5133 Authors: - Devavret Makkar (@devavret) Approvers: - Nghia Truong (@ttnghia) - Vukasin Milovanovic (@vuule) URL: rapidsai#7510
Describe the bug
cudf::binary_operation
currently ignores the CUDA context of the caller thread, which makes the JIT compiled PTX loaded on a wrong device. Even worse is thatcudf::binary_operation
does not check the CUresult from the kernel launch, so the error is being silently ignored, and noticed only with cuda-memcheck.The text was updated successfully, but these errors were encountered: