-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to CUDA 12.5. #332
Update to CUDA 12.5. #332
Conversation
I'm getting an error:
|
@bdice there's a number of reasons you could be seeing this, none of which we can/are going to change. I recommend installing the latest driver. |
@trxcllnt This is on a lab machine where I cannot control the driver. CI and lab machines are only supposed to use LTS or Production Branch drivers, which do not yet support 12.5. We won’t be able to run 12.5 devcontainers in CI (on GPU nodes, at least) or on lab machines. |
I thought the discussion we had in Slack concluded that we should not need driver updates to use 12.5 because we use LTS / PB drivers. xref: rapidsai/build-planning#73 (comment) |
Which machine are you seeing this on? I just ran |
I was on dgx05. I will try the command you gave. Maybe it’s something in how I invoked the devcontainer. |
Command: Error log
|
@trxcllnt Also, can you help me debug the CI failures? I don't know what is going wrong. The pip container fails to find |
That looks to be failing w/ the conda container? We don't even install the CTK in the conda container, it's basically just Ubuntu + miniforge. My guess is the nvidia-container-toolkit is seeing the Does it succeed if you run with |
The conda container is failing to create an env at all because dfg generated emtpy yaml files:
|
Looks like the CUDA feature is trying to install cuDNN v8, but IIRC it's v9 now, so that's why cuDNN isn't getting installed. |
Ah. I think this job should fail earlier and show the error logs from dfg. CUDA 12.5 doesn't have entries in |
No, I get the same error when I run |
I updated this in d4ef78e. I wasn't sure if we wanted to keep libcudnn8 for any CUDA versions or not. If so, let me know. |
Yeah we need to install the right cuDNN version based on the CUDA toolkit. Maybe we can make the cuDNN version a feature input variable? |
It looks like cuDNN 9.2.0 is compatible with 11.8 and 12.0-12.5, which would cover all the devcontainers we produce. https://docs.nvidia.com/deeplearning/cudnn/latest/reference/support-matrix.html#support-matrix |
Yes but not every library works with cuDNN v9 yet (cupy, for example), so we need a variable to allow installing different versions. |
@trxcllnt I'm not sure how to add a variable. Is this something I modify in |
/ok to test |
cuDNN v9 isn't getting installed because they changed the names of the packages between 8 and 9. I'll push a commit that fixes it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trxcllnt I had one question.
Do we need to install Seeing this on CI:
|
No, the problem is there's no matrix entries for CUDA 12.5 in |
This PR updates the CUDA default to 12.5 and also adds RAPIDS devcontainers for CUDA 12.5.
Part of rapidsai/build-planning#73.