Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in rmm async allocator in test_java_url_decode #349

Closed
nvdbaranec opened this issue Jan 22, 2024 · 6 comments
Closed

Crash in rmm async allocator in test_java_url_decode #349

nvdbaranec opened this issue Jan 22, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@nvdbaranec
Copy link

Build failure in CICD. Likely upstream of the async allocator itself.

https://prod.blsm.nvidia.com/sw-gpu-spark-jenkins/job/examples-udf-examples-native/121/

src/main/python/rapids_udf_test.py::test_java_url_decode

[2024-01-22T09:44:17.989Z] 24/01/22 09:44:17 ERROR RapidsExecutorPlugin: Stopping the Executor based on exception being a fatal
 CUDA error: ai.rapids.cudf.CudaFatalException: std::bad_alloc: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-
jni_nightly-dev-652-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device
/cuda_async_view_memory_resource.hpp:128: cudaErrorIllegalAddress an illegal memory access was encountered
@sameerz sameerz assigned jlowe and unassigned GaryShen2008 Jan 23, 2024
@sameerz sameerz added bug Something isn't working and removed ? - Needs Triage labels Jan 23, 2024
@jlowe
Copy link
Member

jlowe commented Jan 24, 2024

@GaryShen2008 at first I thought this was related to #347, but then upon digging I discovered that's not in the build that failed. #347 accidentally was committed to main instead of branch-24.02. 🤦

I'll put up reverts to main and re-PR to branch-24.02 to fix that mistake, but the failure noted in this issue may be related to the ccache change in #345, as that's the only change listed between the good run and the failing run.

@jlowe jlowe assigned GaryShen2008 and unassigned jlowe Jan 24, 2024
@GaryShen2008
Copy link
Collaborator

Not reproduced in local. Trigger CI again.

@GaryShen2008
Copy link
Collaborator

@NvTimLiu CI can reproduced the issue.
Please help to figure it out.

@NvTimLiu
Copy link
Collaborator

NvTimLiu commented Feb 5, 2024

checking

@NvTimLiu
Copy link
Collaborator

NvTimLiu commented Feb 5, 2024

branch-24.04 UDF native examples app with native CUDF is still building to branch-23.12,

https://github.com/NVIDIA/spark-rapids-examples/blob/branch-24.04/examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/CMakeLists.txt#L87-L90

This caused the failure due to latest branch-24.04 rapidsai cuDF/rmm changes not built into UDF native examples jar file, e.g. : rapidsai/rmm#1437

NvTimLiu added a commit to NvTimLiu/spark-rapids-examples that referenced this issue Feb 5, 2024
1, update rapidsai cudf/rmm to branch-24.04, to fix:
    issue: NVIDIA#349

2, Chnage Spark-cuML and rapids-UDF exmaple apps' version to 24.04.0-SNAPSHOT

3, Fix typo in Spark-cuSpatial

Signed-off-by: Tim Liu <timl@nvidia.com>
NvTimLiu added a commit to NvTimLiu/spark-rapids-examples that referenced this issue Feb 5, 2024
Update rapidsai cudf/rmm to branch-24.02, to fix:

issue: NVIDIA#349

Signed-off-by: Tim Liu <timl@nvidia.com>
NvTimLiu added a commit to NvTimLiu/spark-rapids-examples that referenced this issue Feb 5, 2024
Update rapidsai cudf/rmm to branch-24.02 for spark rapids UDF native example app , to fix:

issue: NVIDIA#349

Signed-off-by: Tim Liu <timl@nvidia.com>
jlowe pushed a commit that referenced this issue Feb 5, 2024
Update rapidsai cudf/rmm to branch-24.02 for spark rapids UDF native example app , to fix:

issue: #349

Signed-off-by: Tim Liu <timl@nvidia.com>
NvTimLiu added a commit that referenced this issue Feb 6, 2024
1, update rapidsai cudf/rmm to branch-24.04, to fix:
    issue: #349

2, Chnage Spark-cuML and rapids-UDF exmaple apps' version to 24.04.0-SNAPSHOT

3, Fix typo in Spark-cuSpatial

Signed-off-by: Tim Liu <timl@nvidia.com>
@NvTimLiu
Copy link
Collaborator

NvTimLiu commented Feb 8, 2024

Close as fix #358 #359 got merged

@NvTimLiu NvTimLiu closed this as completed Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants