Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Don't run E2E tests on self-hosted CUDA in Nightly #14041

Merged
merged 1 commit into from
Jun 6, 2024

Conversation

aelovikov-intel
Copy link
Contributor

The runner seems to be broken, don't run the tests until it's fixed.

The runner seems to be broken, don't run the tests until it's fixed.
@aelovikov-intel aelovikov-intel requested review from a team and jchlanda and removed request for a team June 4, 2024 22:05
@aelovikov-intel aelovikov-intel requested a review from a team as a code owner June 4, 2024 22:05
@aelovikov-intel
Copy link
Contributor Author

I believe someone from @intel/llvm-reviewers-cuda (maybe @npmiller ?) has access to the runner and I expect them to fix it and then revert this PR.

@@ -74,13 +74,6 @@ jobs:
target_devices: opencl:cpu
tests_selector: e2e

- name: Self-hosted CUDA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of removing this code, can we just comment it out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to remove. If the runner is unrecoverable or nobody is willing to fix it then there is no reason to have the "dead" comments inside the repo.

@uditagarwal97
Copy link
Contributor

The Linux kernel and headers were updated on the CUDA runner - I don't know how - which caused the Nvidia driver to fail. I got the following error when trying to install Nvidia driver for CUDA 12.1: https://forums.developer.nvidia.com/t/linux-6-7-3-545-29-06-550-40-07-error-modpost-gpl-incompatible-module-nvidia-ko-uses-gpl-only-symbol-rcu-read-lock/280908

Instead, as an experiment, I tried installing CUDA 12.4 libraries and recommended driver, and it seems to work fine: https://github.com/intel/llvm/actions/runs/9360554942/job/25813680144 .(except the known E2E failure: #13661 )

I'll let @npmiller decide if we can keep CUDA 12.4 on the CI. If yes, someone needs to update the docker script (https://github.com/intel/llvm/blob/sycl/devops/containers/ubuntu2204_build.Dockerfile#L1) and disable to failing E2E test.

Copy link
Contributor

@jchlanda jchlanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff looks good, but I can't speak to the rationale of the removal, @npmiller would you like to have a look at this?

@JackAKirk
Copy link
Contributor

JackAKirk commented Jun 5, 2024

The diff looks good, but I can't speak to the rationale of the removal, @npmiller would you like to have a look at this?

Nvidia released the 12.5 dev docker 5 days ago. I'm trying to build it locally now. If that succeeds we can go straight to 12.5. I've already checked that 12.5 passes all e2e tests. and using the updated driver they should have on the docker image, #13661 is fixed.

@npmiller
Copy link
Contributor

npmiller commented Jun 5, 2024

I believe someone from @intel/llvm-reviewers-cuda (maybe @npmiller ?) has access to the runner and I expect them to fix it and then revert this PR.

I don't believe any of us have access to the runners, so I don't think we can fix them or investigate unfortunately.

Thanks @uditagarwal97 for having a look, I think the 12.1 docker image should be able to run fine on the 12.4 driver that's on the runner, so if upgrading the runner's driver solves the issues you were seeing it should be all good even without updating the docker image.

@JackAKirk
Copy link
Contributor

I believe someone from @intel/llvm-reviewers-cuda (maybe @npmiller ?) has access to the runner and I expect them to fix it and then revert this PR.

I don't believe any of us have access to the runners, so I don't think we can fix them or investigate unfortunately.

Thanks @uditagarwal97 for having a look, I think the 12.1 docker image should be able to run fine on the 12.4 driver that's on the runner, so if upgrading the runner's driver solves the issues you were seeing it should be all good even without updating the docker image.

Testing on a runner with a 12.4 driver will result in the test failure here : #13661 (comment)
I recommend using the 12.5 driver : 555.42.02

@aelovikov-intel
Copy link
Contributor Author

I don't believe any of us have access to the runners, so I don't think we can fix them or investigate unfortunately.

Can you please get to the bottom of this so that it would be Codeplay maintaining it and not @uditagarwal97 ?

@aelovikov-intel aelovikov-intel deleted the no-self-cuda branch June 5, 2024 17:38
@aelovikov-intel aelovikov-intel restored the no-self-cuda branch June 6, 2024 17:20
@aelovikov-intel
Copy link
Contributor Author

Latest nightly failed due to infrastructural issues with the runner again. Resurrecting this PR to remove the faulty tasks until Codeplay folks will get access to the runner and assume ownership of that part of the CI.

@aelovikov-intel aelovikov-intel merged commit e51a90a into intel:sycl Jun 6, 2024
5 checks passed
@aelovikov-intel aelovikov-intel deleted the no-self-cuda branch June 6, 2024 22:34
ianayl pushed a commit to ianayl/sycl that referenced this pull request Jun 13, 2024
The runner seems to be broken, don't run the tests until it's fixed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants