Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to CUDA 12.5. #332

Merged
merged 11 commits into from
Jul 17, 2024
Merged

Update to CUDA 12.5. #332

merged 11 commits into from
Jul 17, 2024

Conversation

bdice
Copy link
Contributor

@bdice bdice commented Jun 26, 2024

This PR updates the CUDA default to 12.5 and also adds RAPIDS devcontainers for CUDA 12.5.

Part of rapidsai/build-planning#73.

@bdice
Copy link
Contributor Author

bdice commented Jun 26, 2024

I'm getting an error: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.5, please update your driver to a newer version, or use an earlier cuda container: unknown. I am on driver 535 which is an LTS branch, so I thought we wouldn't have any troubles. @trxcllnt Do you have insight on this?

docker run -it nvidia/cuda:12.5.0-base-ubuntu22.04 works fine on this system with driver 535 so I think it is an issue with how our devcontainers are built.

@trxcllnt
Copy link
Collaborator

trxcllnt commented Jun 26, 2024

@bdice there's a number of reasons you could be seeing this, none of which we can/are going to change. I recommend installing the latest driver.

@bdice
Copy link
Contributor Author

bdice commented Jun 27, 2024

@trxcllnt This is on a lab machine where I cannot control the driver. CI and lab machines are only supposed to use LTS or Production Branch drivers, which do not yet support 12.5. We won’t be able to run 12.5 devcontainers in CI (on GPU nodes, at least) or on lab machines.

@bdice
Copy link
Contributor Author

bdice commented Jun 27, 2024

I thought the discussion we had in Slack concluded that we should not need driver updates to use 12.5 because we use LTS / PB drivers. xref: rapidsai/build-planning#73 (comment)

@trxcllnt
Copy link
Collaborator

trxcllnt commented Jun 27, 2024

Which machine are you seeing this on? I just ran docker run --rm --gpus all rapidsai/devcontainers:24.08-cpp-gcc13-cuda12.5 nvidia-smi on dgx01 w/ 535.161.08 and it worked fine.

@bdice
Copy link
Contributor Author

bdice commented Jun 28, 2024

I was on dgx05. I will try the command you gave. Maybe it’s something in how I invoked the devcontainer.

@bdice
Copy link
Contributor Author

bdice commented Jul 1, 2024

docker run --rm --gpus all rapidsai/devcontainers:24.08-cpp-gcc13-cuda12.5 nvidia-smi works on dgx05 for me. Hmm. Here is the full error log I get when I try to launch the devcontainer on dgx05:

Command: devcontainer up --config .devcontainer/cuda12.5-conda/devcontainer.json --workspace-folder .

Error log
[2024-07-01T21:57:00.173Z] @devcontainers/cli 0.54.2. Node.js v18.15.0. linux 5.4.0-182-generic x64.
[2024-07-01T21:57:00.278Z] Running the initializeCommand from devcontainer.json...

[2024-07-01T21:57:00.278Z] Start: Run: /bin/bash -c mkdir -m 0755 -p /raid/bdice/compose-environments/rapids1/devcontainers/../.{aws,cache,config,conda/pkgs,conda/devcontainers-cuda12.5-envs,log/devcontainer-utils} /raid/bdice/compose-environments/rapids1/devcontainers/../{rmm,kvikio,ucxx,cudf,raft,cuvs,cumlprims_mg,cuml,cugraph-ops,wholegraph,cugraph,cuspatial}
[2024-07-01T21:57:00.283Z] 
[2024-07-01T21:57:01.403Z] Resolving Feature dependencies for './features/src/utils'...
[2024-07-01T21:57:01.405Z] Resolving Feature dependencies for './features/src/rapids-build-utils'...
[2024-07-01T21:57:01.472Z] Start: Run: docker buildx build --load --build-arg BUILDKIT_INLINE_CACHE=1 -f /tmp/devcontainercli-bdice/container-features/0.54.2-1719871021400/Dockerfile-with-features -t vsc-devcontainers-6433542dccae9a9a0285fafc8ae4cf3cd36fd59a9575b19566d180ca37b5db51 --target dev_containers_target_stage --build-arg CUDA=12.5 --build-arg PYTHON_PACKAGE_MANAGER=conda --build-arg BASE=rapidsai/devcontainers:24.08-cpp-mambaforge-ubuntu22.04 --build-context dev_containers_feature_content_source=/tmp/devcontainercli-bdice/container-features/0.54.2-1719871021400 --build-arg _DEV_CONTAINERS_BASE_IMAGE=dev_container_auto_added_stage_label --build-arg _DEV_CONTAINERS_IMAGE_USER=root --build-arg _DEV_CONTAINERS_FEATURE_CONTENT_SOURCE=dev_container_feature_content_temp /raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer
[2024-07-01T21:57:01.824Z] #0 building with "default" instance using docker driver

#1 [internal] load build definition from Dockerfile-with-features
#1 transferring dockerfile: 10.44kB done
#1 DONE 0.0s

#2 resolve image config for docker-image://docker.io/docker/dockerfile:1.5

[2024-07-01T21:57:01.967Z] #2 DONE 0.3s

[2024-07-01T21:57:02.077Z] 
#3 docker-image://docker.io/docker/dockerfile:1.5@sha256:39b85bbfa7536a5feceb7372a0817649ecb2724562a38360f4d6a7782a409b14
#3 CACHED

#4 [internal] load .dockerignore

[2024-07-01T21:57:02.077Z] #4 transferring context: 2B done
#4 DONE 0.0s

#5 [internal] load metadata for docker.io/rapidsai/devcontainers:24.08-cpp-mambaforge-ubuntu22.04

[2024-07-01T21:57:02.234Z] #5 ...

#6 [context dev_containers_feature_content_source] load .dockerignore
#6 transferring dev_containers_feature_content_source: 2B done
#6 DONE 0.0s

[2024-07-01T21:57:02.384Z] 
#5 [internal] load metadata for docker.io/rapidsai/devcontainers:24.08-cpp-mambaforge-ubuntu22.04

[2024-07-01T21:57:03.578Z] #5 DONE 1.5s

[2024-07-01T21:57:04.112Z] 
#7 [conda-base 1/1] FROM docker.io/rapidsai/devcontainers:24.08-cpp-mambaforge-ubuntu22.04@sha256:3817fe57e71da3e5667dbd860729dc5011324440e16e31a13c1b751cb71a2103
#7 DONE 0.0s

#8 [context dev_containers_feature_content_source] load from client
#8 transferring dev_containers_feature_content_source: 275.15kB 0.0s done
#8 DONE 0.0s

#9 [dev_containers_target_stage 2/5] COPY --from=dev_containers_feature_content_normalize /tmp/build-features/ /tmp/dev-container-features
#9 CACHED

[2024-07-01T21:57:04.112Z] 
#10 [dev_containers_feature_content_normalize 1/2] COPY --from=dev_containers_feature_content_source devcontainer-features.builtin.env /tmp/build-features/
#10 CACHED

#11 [dev_containers_feature_content_normalize 2/2] RUN chmod -R 0755 /tmp/build-features/
#11 CACHED

#12 [dev_containers_target_stage 4/5] RUN --mount=type=bind,from=dev_containers_feature_content_source,source=utils_0,target=/tmp/build-features-src/utils_0     cp -ar /tmp/build-features-src/utils_0 /tmp/dev-container-features  && chmod -R 0755 /tmp/dev-container-features/utils_0  && cd /tmp/dev-container-features/utils_0  && chmod +x ./devcontainer-features-install.sh  && ./devcontainer-features-install.sh  && rm -rf /tmp/dev-container-features/utils_0
#12 CACHED

#13 [dev_containers_target_stage 3/5] RUN echo "_CONTAINER_USER_HOME=$( (command -v getent >/dev/null 2>&1 && getent passwd 'root' || grep -E '^root|^[^:]*:[^:]*:root:' /etc/passwd || true) | cut -d: -f6)" >> /tmp/dev-container-features/devcontainer-features.builtin.env && echo "_REMOTE_USER_HOME=$( (command -v getent >/dev/null 2>&1 && getent passwd 'coder' || grep -E '^coder|^[^:]*:[^:]*:coder:' /etc/passwd || true) | cut -d: -f6)" >> /tmp/dev-container-features/devcontainer-features.builtin.env
#13 CACHED

#14 [dev_containers_target_stage 1/5] RUN mkdir -p /tmp/dev-container-features
#14 CACHED

#15 [dev_containers_target_stage 5/5] RUN --mount=type=bind,from=dev_containers_feature_content_source,source=rapids-build-utils_1,target=/tmp/build-features-src/rapids-build-utils_1     cp -ar /tmp/build-features-src/rapids-build-utils_1 /tmp/dev-container-features  && chmod -R 0755 /tmp/dev-container-features/rapids-build-utils_1  && cd /tmp/dev-container-features/rapids-build-utils_1  && chmod +x ./devcontainer-features-install.sh  && ./devcontainer-features-install.sh  && rm -rf /tmp/dev-container-features/rapids-build-utils_1
#15 CACHED

#16 exporting to image
#16 exporting layers done
#16 preparing layers for inline cache done
#16 writing image sha256:9f663f77db74298f79e8eb1a71e24b251aab14b89f590948d6a526ec1f2949f3 done
#16 naming to docker.io/library/vsc-devcontainers-6433542dccae9a9a0285fafc8ae4cf3cd36fd59a9575b19566d180ca37b5db51 done
#16 DONE 0.0s

[2024-07-01T21:57:07.334Z] Start: Run: docker run --sig-proxy=false -a STDOUT -a STDERR --mount source=/raid/bdice/compose-environments/rapids1/devcontainers,target=/home/coder/devcontainers,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../rmm,target=/home/coder/rmm,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../kvikio,target=/home/coder/kvikio,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../ucxx,target=/home/coder/ucxx,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cudf,target=/home/coder/cudf,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../raft,target=/home/coder/raft,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuvs,target=/home/coder/cuvs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cumlprims_mg,target=/home/coder/cumlprims_mg,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuml,target=/home/coder/cuml,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cugraph-ops,target=/home/coder/cugraph-ops,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../wholegraph,target=/home/coder/wholegraph,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cugraph,target=/home/coder/cugraph,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuspatial,target=/home/coder/cuspatial,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.aws,target=/home/coder/.aws,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.cache,target=/home/coder/.cache,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.config,target=/home/coder/.config,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.conda/pkgs,target=/home/coder/.conda/pkgs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.conda/devcontainers-cuda12.5-envs,target=/home/coder/.conda/envs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.log/devcontainer-utils,target=/var/log/devcontainer-utils,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/features/src/utils/opt/devcontainer/bin,target=/opt/devcontainer/bin,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/features/src/rapids-build-utils/opt/rapids-build-utils,target=/opt/rapids-build-utils,type=bind,consistency=consistent -l devcontainer.local_folder=/raid/bdice/compose-environments/rapids1/devcontainers -l devcontainer.config_file=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/devcontainer.json -u root --rm --name bdice-rapids-devcontainers-24.08-cuda12.5-conda --gpus all --entrypoint /bin/sh vsc-devcontainers-6433542dccae9a9a0285fafc8ae4cf3cd36fd59a9575b19566d180ca37b5db51-uid -c echo Container started
[2024-07-01T21:57:07.771Z] docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.5, please update your driver to a newer version, or use an earlier cuda container: unknown.
Error: Command failed: docker run --sig-proxy=false -a STDOUT -a STDERR --mount source=/raid/bdice/compose-environments/rapids1/devcontainers,target=/home/coder/devcontainers,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../rmm,target=/home/coder/rmm,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../kvikio,target=/home/coder/kvikio,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../ucxx,target=/home/coder/ucxx,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cudf,target=/home/coder/cudf,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../raft,target=/home/coder/raft,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuvs,target=/home/coder/cuvs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cumlprims_mg,target=/home/coder/cumlprims_mg,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuml,target=/home/coder/cuml,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cugraph-ops,target=/home/coder/cugraph-ops,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../wholegraph,target=/home/coder/wholegraph,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cugraph,target=/home/coder/cugraph,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuspatial,target=/home/coder/cuspatial,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.aws,target=/home/coder/.aws,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.cache,target=/home/coder/.cache,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.config,target=/home/coder/.config,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.conda/pkgs,target=/home/coder/.conda/pkgs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.conda/devcontainers-cuda12.5-envs,target=/home/coder/.conda/envs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.log/devcontainer-utils,target=/var/log/devcontainer-utils,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/features/src/utils/opt/devcontainer/bin,target=/opt/devcontainer/bin,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/features/src/rapids-build-utils/opt/rapids-build-utils,target=/opt/rapids-build-utils,type=bind,consistency=consistent -l devcontainer.local_folder=/raid/bdice/compose-environments/rapids1/devcontainers -l devcontainer.config_file=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/devcontainer.json -u root --rm --name bdice-rapids-devcontainers-24.08-cuda12.5-conda --gpus all --entrypoint /bin/sh vsc-devcontainers-6433542dccae9a9a0285fafc8ae4cf3cd36fd59a9575b19566d180ca37b5db51-uid -c echo Container started
trap "exit 0" 15

exec "$@"
while sleep 1 & wait $!; do :; done -
    at J$ (/home/nfs/bdice/mambaforge/envs/dice/lib/node_modules/@devcontainers/cli/dist/spec-node/devContainersSpecCLI.js:462:1253)
    at $J (/home/nfs/bdice/mambaforge/envs/dice/lib/node_modules/@devcontainers/cli/dist/spec-node/devContainersSpecCLI.js:462:997)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async tAA (/home/nfs/bdice/mambaforge/envs/dice/lib/node_modules/@devcontainers/cli/dist/spec-node/devContainersSpecCLI.js:479:3660)
    at async CC (/home/nfs/bdice/mambaforge/envs/dice/lib/node_modules/@devcontainers/cli/dist/spec-node/devContainersSpecCLI.js:479:4775)
    at async NeA (/home/nfs/bdice/mambaforge/envs/dice/lib/node_modules/@devcontainers/cli/dist/spec-node/devContainersSpecCLI.js:612:11107)
    at async MeA (/home/nfs/bdice/mambaforge/envs/dice/lib/node_modules/@devcontainers/cli/dist/spec-node/devContainersSpecCLI.js:612:10848)
{"outcome":"error","message":"Command failed: docker run --sig-proxy=false -a STDOUT -a STDERR --mount source=/raid/bdice/compose-environments/rapids1/devcontainers,target=/home/coder/devcontainers,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../rmm,target=/home/coder/rmm,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../kvikio,target=/home/coder/kvikio,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../ucxx,target=/home/coder/ucxx,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cudf,target=/home/coder/cudf,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../raft,target=/home/coder/raft,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuvs,target=/home/coder/cuvs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cumlprims_mg,target=/home/coder/cumlprims_mg,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuml,target=/home/coder/cuml,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cugraph-ops,target=/home/coder/cugraph-ops,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../wholegraph,target=/home/coder/wholegraph,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cugraph,target=/home/coder/cugraph,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuspatial,target=/home/coder/cuspatial,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.aws,target=/home/coder/.aws,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.cache,target=/home/coder/.cache,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.config,target=/home/coder/.config,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.conda/pkgs,target=/home/coder/.conda/pkgs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.conda/devcontainers-cuda12.5-envs,target=/home/coder/.conda/envs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.log/devcontainer-utils,target=/var/log/devcontainer-utils,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/features/src/utils/opt/devcontainer/bin,target=/opt/devcontainer/bin,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/features/src/rapids-build-utils/opt/rapids-build-utils,target=/opt/rapids-build-utils,type=bind,consistency=consistent -l devcontainer.local_folder=/raid/bdice/compose-environments/rapids1/devcontainers -l devcontainer.config_file=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/devcontainer.json -u root --rm --name bdice-rapids-devcontainers-24.08-cuda12.5-conda --gpus all --entrypoint /bin/sh vsc-devcontainers-6433542dccae9a9a0285fafc8ae4cf3cd36fd59a9575b19566d180ca37b5db51-uid -c echo Container started\ntrap \"exit 0\" 15\n\nexec \"$@\"\nwhile sleep 1 & wait $!; do :; done -","description":"An error occurred setting up the container."}

@bdice bdice requested a review from a team as a code owner July 1, 2024 22:01
@bdice bdice requested review from AyodeAwe and removed request for a team July 1, 2024 22:01
@bdice
Copy link
Contributor Author

bdice commented Jul 1, 2024

@trxcllnt Also, can you help me debug the CI failures? I don't know what is going wrong. The pip container fails to find cudnn and the conda container fails to find gcc. I am going to update the branch to see if these issues reoccur.

@trxcllnt
Copy link
Collaborator

trxcllnt commented Jul 2, 2024

That looks to be failing w/ the conda container? We don't even install the CTK in the conda container, it's basically just Ubuntu + miniforge.

My guess is the nvidia-container-toolkit is seeing the ENV CUDA_VERSION and inferring the NVIDIA_REQUIRE_CUDA constraints automatically.

Does it succeed if you run with --remote-env NVIDIA_DISABLE_REQUIRE=true?

@trxcllnt
Copy link
Collaborator

trxcllnt commented Jul 2, 2024

The conda container is failing to create an env at all because dfg generated emtpy yaml files:

  Not creating 'rapids' conda environment because 'rapids.yml' is empty.

@trxcllnt
Copy link
Collaborator

trxcllnt commented Jul 2, 2024

Looks like the CUDA feature is trying to install cuDNN v8, but IIRC it's v9 now, so that's why cuDNN isn't getting installed.

@bdice
Copy link
Contributor Author

bdice commented Jul 2, 2024

The conda container is failing to create an env at all because dfg generated emtpy yaml files:

Ah. I think this job should fail earlier and show the error logs from dfg. CUDA 12.5 doesn't have entries in dependencies.yaml for any RAPIDS repos yet. I had hoped to run CUDA 12.5 tests in unified devcontainers before opening PRs to every repo. Maybe I will start with the PRs to individual repos and come back to this repo later.

@bdice
Copy link
Contributor Author

bdice commented Jul 2, 2024

Does it succeed if you run with --remote-env NVIDIA_DISABLE_REQUIRE=true?

No, I get the same error when I run devcontainer up --remote-env NVIDIA_DISABLE_REQUIRE=true --config .devcontainer/cuda12.5-conda/devcontainer.json --workspace-folder . as before.

@bdice
Copy link
Contributor Author

bdice commented Jul 2, 2024

Looks like the CUDA feature is trying to install cuDNN v8, but IIRC it's v9 now, so that's why cuDNN isn't getting installed.

I updated this in d4ef78e. I wasn't sure if we wanted to keep libcudnn8 for any CUDA versions or not. If so, let me know.

@trxcllnt
Copy link
Collaborator

trxcllnt commented Jul 2, 2024

Yeah we need to install the right cuDNN version based on the CUDA toolkit. Maybe we can make the cuDNN version a feature input variable?

@bdice
Copy link
Contributor Author

bdice commented Jul 2, 2024

Yeah we need to install the right cuDNN version based on the CUDA toolkit. Maybe we can make the cuDNN version a feature input variable?

It looks like cuDNN 9.2.0 is compatible with 11.8 and 12.0-12.5, which would cover all the devcontainers we produce. https://docs.nvidia.com/deeplearning/cudnn/latest/reference/support-matrix.html#support-matrix

@trxcllnt
Copy link
Collaborator

trxcllnt commented Jul 2, 2024

Yes but not every library works with cuDNN v9 yet (cupy, for example), so we need a variable to allow installing different versions.

@bdice
Copy link
Contributor Author

bdice commented Jul 2, 2024

@trxcllnt I'm not sure how to add a variable. Is this something I modify in matrix.yaml?

@bdice
Copy link
Contributor Author

bdice commented Jul 2, 2024

Maybe I got it right? I guessed. See deba81b and d8f91e9.

@trxcllnt
Copy link
Collaborator

trxcllnt commented Jul 2, 2024

/ok to test

@trxcllnt
Copy link
Collaborator

trxcllnt commented Jul 2, 2024

cuDNN v9 isn't getting installed because they changed the names of the packages between 8 and 9. I'll push a commit that fixes it.

Copy link
Contributor Author

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trxcllnt I had one question.

features/src/cuda/install.sh Outdated Show resolved Hide resolved
@jakirkham
Copy link
Member

Do we need to install cxx-compiler somewhere and point CMake to it?

Seeing this on CI:

CMake Error at /usr/share/cmake-3.30/Modules/CMakeDetermineCXXCompiler.cmake:48 (message):
  Could not find compiler set in environment variable CXX:

  /usr/bin/g++.

Call Stack (most recent call first):
  CMakeLists.txt:24 (project)


CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage
-- Configuring incomplete, errors occurred!

@trxcllnt
Copy link
Collaborator

No, the problem is there's no matrix entries for CUDA 12.5 in dependencies.yaml (e.g. here), causing rapids-dependency-file-generator to output an empty conda environment yaml file and nothing to be installed.

@trxcllnt trxcllnt merged commit 1bd1bd5 into rapidsai:branch-24.08 Jul 17, 2024
211 of 212 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants