[CI] Don't run E2E tests on self-hosted CUDA in Nightly #14041

aelovikov-intel · 2024-06-04T22:05:53Z

The runner seems to be broken, don't run the tests until it's fixed.

aelovikov-intel · 2024-06-04T22:07:37Z

I believe someone from @intel/llvm-reviewers-cuda (maybe @npmiller ?) has access to the runner and I expect them to fix it and then revert this PR.

uditagarwal97 · 2024-06-04T22:10:54Z

.github/workflows/sycl-nightly.yml

@@ -74,13 +74,6 @@ jobs:
            target_devices: opencl:cpu
            tests_selector: e2e

-          - name: Self-hosted CUDA


Instead of removing this code, can we just comment it out?

I think it's better to remove. If the runner is unrecoverable or nobody is willing to fix it then there is no reason to have the "dead" comments inside the repo.

uditagarwal97 · 2024-06-05T00:30:08Z

The Linux kernel and headers were updated on the CUDA runner - I don't know how - which caused the Nvidia driver to fail. I got the following error when trying to install Nvidia driver for CUDA 12.1: https://forums.developer.nvidia.com/t/linux-6-7-3-545-29-06-550-40-07-error-modpost-gpl-incompatible-module-nvidia-ko-uses-gpl-only-symbol-rcu-read-lock/280908

Instead, as an experiment, I tried installing CUDA 12.4 libraries and recommended driver, and it seems to work fine: https://github.com/intel/llvm/actions/runs/9360554942/job/25813680144 .(except the known E2E failure: #13661 )

I'll let @npmiller decide if we can keep CUDA 12.4 on the CI. If yes, someone needs to update the docker script (https://github.com/intel/llvm/blob/sycl/devops/containers/ubuntu2204_build.Dockerfile#L1) and disable to failing E2E test.

jchlanda

The diff looks good, but I can't speak to the rationale of the removal, @npmiller would you like to have a look at this?

JackAKirk · 2024-06-05T09:19:57Z

The diff looks good, but I can't speak to the rationale of the removal, @npmiller would you like to have a look at this?

Nvidia released the 12.5 dev docker 5 days ago. I'm trying to build it locally now. If that succeeds we can go straight to 12.5. I've already checked that 12.5 passes all e2e tests. and using the updated driver they should have on the docker image, #13661 is fixed.

npmiller · 2024-06-05T09:35:01Z

I believe someone from @intel/llvm-reviewers-cuda (maybe @npmiller ?) has access to the runner and I expect them to fix it and then revert this PR.

I don't believe any of us have access to the runners, so I don't think we can fix them or investigate unfortunately.

Thanks @uditagarwal97 for having a look, I think the 12.1 docker image should be able to run fine on the 12.4 driver that's on the runner, so if upgrading the runner's driver solves the issues you were seeing it should be all good even without updating the docker image.

JackAKirk · 2024-06-05T09:40:25Z

I believe someone from @intel/llvm-reviewers-cuda (maybe @npmiller ?) has access to the runner and I expect them to fix it and then revert this PR.

I don't believe any of us have access to the runners, so I don't think we can fix them or investigate unfortunately.

Thanks @uditagarwal97 for having a look, I think the 12.1 docker image should be able to run fine on the 12.4 driver that's on the runner, so if upgrading the runner's driver solves the issues you were seeing it should be all good even without updating the docker image.

Testing on a runner with a 12.4 driver will result in the test failure here : #13661 (comment)
I recommend using the 12.5 driver : 555.42.02

aelovikov-intel · 2024-06-05T16:04:43Z

I don't believe any of us have access to the runners, so I don't think we can fix them or investigate unfortunately.

Can you please get to the bottom of this so that it would be Codeplay maintaining it and not @uditagarwal97 ?

aelovikov-intel · 2024-06-06T17:21:10Z

Latest nightly failed due to infrastructural issues with the runner again. Resurrecting this PR to remove the faulty tasks until Codeplay folks will get access to the runner and assume ownership of that part of the CI.

The runner seems to be broken, don't run the tests until it's fixed.

[CI] Don't run E2E tests on self-hosted CUDA in Nightly

a98ff1f

The runner seems to be broken, don't run the tests until it's fixed.

aelovikov-intel requested review from a team and jchlanda and removed request for a team June 4, 2024 22:05

aelovikov-intel requested a review from a team as a code owner June 4, 2024 22:05

uditagarwal97 reviewed Jun 4, 2024

View reviewed changes

jchlanda approved these changes Jun 5, 2024

View reviewed changes

aelovikov-intel closed this Jun 5, 2024

aelovikov-intel deleted the no-self-cuda branch June 5, 2024 17:38

aelovikov-intel restored the no-self-cuda branch June 6, 2024 17:20

aelovikov-intel reopened this Jun 6, 2024

uditagarwal97 approved these changes Jun 6, 2024

View reviewed changes

aelovikov-intel merged commit e51a90a into intel:sycl Jun 6, 2024
5 checks passed

aelovikov-intel deleted the no-self-cuda branch June 6, 2024 22:34

ianayl pushed a commit to ianayl/sycl that referenced this pull request Jun 13, 2024

[CI] Don't run E2E tests on self-hosted CUDA in Nightly (intel#14041)

0d3ecc4

The runner seems to be broken, don't run the tests until it's fixed.

uditagarwal97 mentioned this pull request Jun 14, 2024

[CI] Use cuda 12.1 docker image again. #14179

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Don't run E2E tests on self-hosted CUDA in Nightly #14041

[CI] Don't run E2E tests on self-hosted CUDA in Nightly #14041

aelovikov-intel commented Jun 4, 2024

aelovikov-intel commented Jun 4, 2024

uditagarwal97 Jun 4, 2024

aelovikov-intel Jun 4, 2024

uditagarwal97 commented Jun 5, 2024

jchlanda left a comment

JackAKirk commented Jun 5, 2024 •

edited

Loading

npmiller commented Jun 5, 2024

JackAKirk commented Jun 5, 2024

aelovikov-intel commented Jun 5, 2024

aelovikov-intel commented Jun 6, 2024

[CI] Don't run E2E tests on self-hosted CUDA in Nightly #14041

[CI] Don't run E2E tests on self-hosted CUDA in Nightly #14041

Conversation

aelovikov-intel commented Jun 4, 2024

aelovikov-intel commented Jun 4, 2024

uditagarwal97 Jun 4, 2024

Choose a reason for hiding this comment

aelovikov-intel Jun 4, 2024

Choose a reason for hiding this comment

uditagarwal97 commented Jun 5, 2024

jchlanda left a comment

Choose a reason for hiding this comment

JackAKirk commented Jun 5, 2024 • edited Loading

npmiller commented Jun 5, 2024

JackAKirk commented Jun 5, 2024

aelovikov-intel commented Jun 5, 2024

aelovikov-intel commented Jun 6, 2024

JackAKirk commented Jun 5, 2024 •

edited

Loading