Update to CUDA 12.5. #332

bdice · 2024-06-26T17:12:30Z

This PR updates the CUDA default to 12.5 and also adds RAPIDS devcontainers for CUDA 12.5.

bdice · 2024-06-26T17:25:28Z

I'm getting an error: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.5, please update your driver to a newer version, or use an earlier cuda container: unknown. I am on driver 535 which is an LTS branch, so I thought we wouldn't have any troubles. @trxcllnt Do you have insight on this?

docker run -it nvidia/cuda:12.5.0-base-ubuntu22.04 works fine on this system with driver 535 so I think it is an issue with how our devcontainers are built.

trxcllnt · 2024-06-26T23:11:26Z

@bdice there's a number of reasons you could be seeing this, none of which we can/are going to change. I recommend installing the latest driver.

bdice · 2024-06-27T04:01:20Z

@trxcllnt This is on a lab machine where I cannot control the driver. CI and lab machines are only supposed to use LTS or Production Branch drivers, which do not yet support 12.5. We won’t be able to run 12.5 devcontainers in CI (on GPU nodes, at least) or on lab machines.

bdice · 2024-06-27T04:02:13Z

I thought the discussion we had in Slack concluded that we should not need driver updates to use 12.5 because we use LTS / PB drivers. xref: rapidsai/build-planning#73 (comment)

trxcllnt · 2024-06-27T23:46:43Z

Which machine are you seeing this on? I just ran docker run --rm --gpus all rapidsai/devcontainers:24.08-cpp-gcc13-cuda12.5 nvidia-smi on dgx01 w/ 535.161.08 and it worked fine.

bdice · 2024-06-28T01:59:53Z

I was on dgx05. I will try the command you gave. Maybe it’s something in how I invoked the devcontainer.

bdice · 2024-07-01T22:00:54Z

docker run --rm --gpus all rapidsai/devcontainers:24.08-cpp-gcc13-cuda12.5 nvidia-smi works on dgx05 for me. Hmm. Here is the full error log I get when I try to launch the devcontainer on dgx05:

Command: devcontainer up --config .devcontainer/cuda12.5-conda/devcontainer.json --workspace-folder .

Error log

[2024-07-01T21:57:00.173Z] @devcontainers/cli 0.54.2. Node.js v18.15.0. linux 5.4.0-182-generic x64.
[2024-07-01T21:57:00.278Z] Running the initializeCommand from devcontainer.json...

[2024-07-01T21:57:00.278Z] Start: Run: /bin/bash -c mkdir -m 0755 -p /raid/bdice/compose-environments/rapids1/devcontainers/../.{aws,cache,config,conda/pkgs,conda/devcontainers-cuda12.5-envs,log/devcontainer-utils} /raid/bdice/compose-environments/rapids1/devcontainers/../{rmm,kvikio,ucxx,cudf,raft,cuvs,cumlprims_mg,cuml,cugraph-ops,wholegraph,cugraph,cuspatial}
[2024-07-01T21:57:00.283Z] 
[2024-07-01T21:57:01.403Z] Resolving Feature dependencies for './features/src/utils'...
[2024-07-01T21:57:01.405Z] Resolving Feature dependencies for './features/src/rapids-build-utils'...
[2024-07-01T21:57:01.472Z] Start: Run: docker buildx build --load --build-arg BUILDKIT_INLINE_CACHE=1 -f /tmp/devcontainercli-bdice/container-features/0.54.2-1719871021400/Dockerfile-with-features -t vsc-devcontainers-6433542dccae9a9a0285fafc8ae4cf3cd36fd59a9575b19566d180ca37b5db51 --target dev_containers_target_stage --build-arg CUDA=12.5 --build-arg PYTHON_PACKAGE_MANAGER=conda --build-arg BASE=rapidsai/devcontainers:24.08-cpp-mambaforge-ubuntu22.04 --build-context dev_containers_feature_content_source=/tmp/devcontainercli-bdice/container-features/0.54.2-1719871021400 --build-arg _DEV_CONTAINERS_BASE_IMAGE=dev_container_auto_added_stage_label --build-arg _DEV_CONTAINERS_IMAGE_USER=root --build-arg _DEV_CONTAINERS_FEATURE_CONTENT_SOURCE=dev_container_feature_content_temp /raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer
[2024-07-01T21:57:01.824Z] #0 building with "default" instance using docker driver

#1 [internal] load build definition from Dockerfile-with-features
#1 transferring dockerfile: 10.44kB done
#1 DONE 0.0s

#2 resolve image config for docker-image://docker.io/docker/dockerfile:1.5

[2024-07-01T21:57:01.967Z] #2 DONE 0.3s

[2024-07-01T21:57:02.077Z] 
#3 docker-image://docker.io/docker/dockerfile:1.5@sha256:39b85bbfa7536a5feceb7372a0817649ecb2724562a38360f4d6a7782a409b14
#3 CACHED

#4 [internal] load .dockerignore

[2024-07-01T21:57:02.077Z] #4 transferring context: 2B done
#4 DONE 0.0s

#5 [internal] load metadata for docker.io/rapidsai/devcontainers:24.08-cpp-mambaforge-ubuntu22.04

[2024-07-01T21:57:02.234Z] #5 ...

#6 [context dev_containers_feature_content_source] load .dockerignore
#6 transferring dev_containers_feature_content_source: 2B done
#6 DONE 0.0s

[2024-07-01T21:57:02.384Z] 
#5 [internal] load metadata for docker.io/rapidsai/devcontainers:24.08-cpp-mambaforge-ubuntu22.04

[2024-07-01T21:57:03.578Z] #5 DONE 1.5s

[2024-07-01T21:57:04.112Z] 
#7 [conda-base 1/1] FROM docker.io/rapidsai/devcontainers:24.08-cpp-mambaforge-ubuntu22.04@sha256:3817fe57e71da3e5667dbd860729dc5011324440e16e31a13c1b751cb71a2103
#7 DONE 0.0s

#8 [context dev_containers_feature_content_source] load from client
#8 transferring dev_containers_feature_content_source: 275.15kB 0.0s done
#8 DONE 0.0s

#9 [dev_containers_target_stage 2/5] COPY --from=dev_containers_feature_content_normalize /tmp/build-features/ /tmp/dev-container-features
#9 CACHED

[2024-07-01T21:57:04.112Z] 
#10 [dev_containers_feature_content_normalize 1/2] COPY --from=dev_containers_feature_content_source devcontainer-features.builtin.env /tmp/build-features/
#10 CACHED

#11 [dev_containers_feature_content_normalize 2/2] RUN chmod -R 0755 /tmp/build-features/
#11 CACHED

#12 [dev_containers_target_stage 4/5] RUN --mount=type=bind,from=dev_containers_feature_content_source,source=utils_0,target=/tmp/build-features-src/utils_0     cp -ar /tmp/build-features-src/utils_0 /tmp/dev-container-features  && chmod -R 0755 /tmp/dev-container-features/utils_0  && cd /tmp/dev-container-features/utils_0  && chmod +x ./devcontainer-features-install.sh  && ./devcontainer-features-install.sh  && rm -rf /tmp/dev-container-features/utils_0
#12 CACHED

#13 [dev_containers_target_stage 3/5] RUN echo "_CONTAINER_USER_HOME=$( (command -v getent >/dev/null 2>&1 && getent passwd 'root' || grep -E '^root|^[^:]*:[^:]*:root:' /etc/passwd || true) | cut -d: -f6)" >> /tmp/dev-container-features/devcontainer-features.builtin.env && echo "_REMOTE_USER_HOME=$( (command -v getent >/dev/null 2>&1 && getent passwd 'coder' || grep -E '^coder|^[^:]*:[^:]*:coder:' /etc/passwd || true) | cut -d: -f6)" >> /tmp/dev-container-features/devcontainer-features.builtin.env
#13 CACHED

#14 [dev_containers_target_stage 1/5] RUN mkdir -p /tmp/dev-container-features
#14 CACHED

#15 [dev_containers_target_stage 5/5] RUN --mount=type=bind,from=dev_containers_feature_content_source,source=rapids-build-utils_1,target=/tmp/build-features-src/rapids-build-utils_1     cp -ar /tmp/build-features-src/rapids-build-utils_1 /tmp/dev-container-features  && chmod -R 0755 /tmp/dev-container-features/rapids-build-utils_1  && cd /tmp/dev-container-features/rapids-build-utils_1  && chmod +x ./devcontainer-features-install.sh  && ./devcontainer-features-install.sh  && rm -rf /tmp/dev-container-features/rapids-build-utils_1
#15 CACHED

#16 exporting to image
#16 exporting layers done
#16 preparing layers for inline cache done
#16 writing image sha256:9f663f77db74298f79e8eb1a71e24b251aab14b89f590948d6a526ec1f2949f3 done
#16 naming to docker.io/library/vsc-devcontainers-6433542dccae9a9a0285fafc8ae4cf3cd36fd59a9575b19566d180ca37b5db51 done
#16 DONE 0.0s

[2024-07-01T21:57:07.334Z] Start: Run: docker run --sig-proxy=false -a STDOUT -a STDERR --mount source=/raid/bdice/compose-environments/rapids1/devcontainers,target=/home/coder/devcontainers,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../rmm,target=/home/coder/rmm,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../kvikio,target=/home/coder/kvikio,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../ucxx,target=/home/coder/ucxx,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cudf,target=/home/coder/cudf,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../raft,target=/home/coder/raft,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuvs,target=/home/coder/cuvs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cumlprims_mg,target=/home/coder/cumlprims_mg,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuml,target=/home/coder/cuml,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cugraph-ops,target=/home/coder/cugraph-ops,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../wholegraph,target=/home/coder/wholegraph,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cugraph,target=/home/coder/cugraph,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuspatial,target=/home/coder/cuspatial,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.aws,target=/home/coder/.aws,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.cache,target=/home/coder/.cache,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.config,target=/home/coder/.config,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.conda/pkgs,target=/home/coder/.conda/pkgs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.conda/devcontainers-cuda12.5-envs,target=/home/coder/.conda/envs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.log/devcontainer-utils,target=/var/log/devcontainer-utils,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/features/src/utils/opt/devcontainer/bin,target=/opt/devcontainer/bin,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/features/src/rapids-build-utils/opt/rapids-build-utils,target=/opt/rapids-build-utils,type=bind,consistency=consistent -l devcontainer.local_folder=/raid/bdice/compose-environments/rapids1/devcontainers -l devcontainer.config_file=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/devcontainer.json -u root --rm --name bdice-rapids-devcontainers-24.08-cuda12.5-conda --gpus all --entrypoint /bin/sh vsc-devcontainers-6433542dccae9a9a0285fafc8ae4cf3cd36fd59a9575b19566d180ca37b5db51-uid -c echo Container started
[2024-07-01T21:57:07.771Z] docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.5, please update your driver to a newer version, or use an earlier cuda container: unknown.
Error: Command failed: docker run --sig-proxy=false -a STDOUT -a STDERR --mount source=/raid/bdice/compose-environments/rapids1/devcontainers,target=/home/coder/devcontainers,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../rmm,target=/home/coder/rmm,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../kvikio,target=/home/coder/kvikio,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../ucxx,target=/home/coder/ucxx,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cudf,target=/home/coder/cudf,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../raft,target=/home/coder/raft,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuvs,target=/home/coder/cuvs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cumlprims_mg,target=/home/coder/cumlprims_mg,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuml,target=/home/coder/cuml,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cugraph-ops,target=/home/coder/cugraph-ops,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../wholegraph,target=/home/coder/wholegraph,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cugraph,target=/home/coder/cugraph,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuspatial,target=/home/coder/cuspatial,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.aws,target=/home/coder/.aws,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.cache,target=/home/coder/.cache,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.config,target=/home/coder/.config,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.conda/pkgs,target=/home/coder/.conda/pkgs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.conda/devcontainers-cuda12.5-envs,target=/home/coder/.conda/envs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.log/devcontainer-utils,target=/var/log/devcontainer-utils,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/features/src/utils/opt/devcontainer/bin,target=/opt/devcontainer/bin,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/features/src/rapids-build-utils/opt/rapids-build-utils,target=/opt/rapids-build-utils,type=bind,consistency=consistent -l devcontainer.local_folder=/raid/bdice/compose-environments/rapids1/devcontainers -l devcontainer.config_file=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/devcontainer.json -u root --rm --name bdice-rapids-devcontainers-24.08-cuda12.5-conda --gpus all --entrypoint /bin/sh vsc-devcontainers-6433542dccae9a9a0285fafc8ae4cf3cd36fd59a9575b19566d180ca37b5db51-uid -c echo Container started
trap "exit 0" 15

exec "$@"
while sleep 1 & wait $!; do :; done -
    at J$ (/home/nfs/bdice/mambaforge/envs/dice/lib/node_modules/@devcontainers/cli/dist/spec-node/devContainersSpecCLI.js:462:1253)
    at $J (/home/nfs/bdice/mambaforge/envs/dice/lib/node_modules/@devcontainers/cli/dist/spec-node/devContainersSpecCLI.js:462:997)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async tAA (/home/nfs/bdice/mambaforge/envs/dice/lib/node_modules/@devcontainers/cli/dist/spec-node/devContainersSpecCLI.js:479:3660)
    at async CC (/home/nfs/bdice/mambaforge/envs/dice/lib/node_modules/@devcontainers/cli/dist/spec-node/devContainersSpecCLI.js:479:4775)
    at async NeA (/home/nfs/bdice/mambaforge/envs/dice/lib/node_modules/@devcontainers/cli/dist/spec-node/devContainersSpecCLI.js:612:11107)
    at async MeA (/home/nfs/bdice/mambaforge/envs/dice/lib/node_modules/@devcontainers/cli/dist/spec-node/devContainersSpecCLI.js:612:10848)
{"outcome":"error","message":"Command failed: docker run --sig-proxy=false -a STDOUT -a STDERR --mount source=/raid/bdice/compose-environments/rapids1/devcontainers,target=/home/coder/devcontainers,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../rmm,target=/home/coder/rmm,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../kvikio,target=/home/coder/kvikio,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../ucxx,target=/home/coder/ucxx,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cudf,target=/home/coder/cudf,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../raft,target=/home/coder/raft,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuvs,target=/home/coder/cuvs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cumlprims_mg,target=/home/coder/cumlprims_mg,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuml,target=/home/coder/cuml,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cugraph-ops,target=/home/coder/cugraph-ops,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../wholegraph,target=/home/coder/wholegraph,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cugraph,target=/home/coder/cugraph,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../cuspatial,target=/home/coder/cuspatial,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.aws,target=/home/coder/.aws,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.cache,target=/home/coder/.cache,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.config,target=/home/coder/.config,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.conda/pkgs,target=/home/coder/.conda/pkgs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.conda/devcontainers-cuda12.5-envs,target=/home/coder/.conda/envs,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/../.log/devcontainer-utils,target=/var/log/devcontainer-utils,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/features/src/utils/opt/devcontainer/bin,target=/opt/devcontainer/bin,type=bind,consistency=consistent --mount source=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/features/src/rapids-build-utils/opt/rapids-build-utils,target=/opt/rapids-build-utils,type=bind,consistency=consistent -l devcontainer.local_folder=/raid/bdice/compose-environments/rapids1/devcontainers -l devcontainer.config_file=/raid/bdice/compose-environments/rapids1/devcontainers/.devcontainer/cuda12.5-conda/devcontainer.json -u root --rm --name bdice-rapids-devcontainers-24.08-cuda12.5-conda --gpus all --entrypoint /bin/sh vsc-devcontainers-6433542dccae9a9a0285fafc8ae4cf3cd36fd59a9575b19566d180ca37b5db51-uid -c echo Container started\ntrap \"exit 0\" 15\n\nexec \"$@\"\nwhile sleep 1 & wait $!; do :; done -","description":"An error occurred setting up the container."}

bdice · 2024-07-01T22:01:10Z

@trxcllnt Also, can you help me debug the CI failures? I don't know what is going wrong. The pip container fails to find cudnn and the conda container fails to find gcc. I am going to update the branch to see if these issues reoccur.

trxcllnt · 2024-07-02T06:56:21Z

That looks to be failing w/ the conda container? We don't even install the CTK in the conda container, it's basically just Ubuntu + miniforge.

My guess is the nvidia-container-toolkit is seeing the ENV CUDA_VERSION and inferring the NVIDIA_REQUIRE_CUDA constraints automatically.

Does it succeed if you run with --remote-env NVIDIA_DISABLE_REQUIRE=true?

trxcllnt · 2024-07-02T06:58:35Z

The conda container is failing to create an env at all because dfg generated emtpy yaml files:

  Not creating 'rapids' conda environment because 'rapids.yml' is empty.

trxcllnt · 2024-07-02T16:49:14Z

Looks like the CUDA feature is trying to install cuDNN v8, but IIRC it's v9 now, so that's why cuDNN isn't getting installed.

bdice · 2024-07-02T17:31:33Z

The conda container is failing to create an env at all because dfg generated emtpy yaml files:

Ah. I think this job should fail earlier and show the error logs from dfg. CUDA 12.5 doesn't have entries in dependencies.yaml for any RAPIDS repos yet. I had hoped to run CUDA 12.5 tests in unified devcontainers before opening PRs to every repo. Maybe I will start with the PRs to individual repos and come back to this repo later.

bdice · 2024-07-02T17:32:26Z

Does it succeed if you run with --remote-env NVIDIA_DISABLE_REQUIRE=true?

No, I get the same error when I run devcontainer up --remote-env NVIDIA_DISABLE_REQUIRE=true --config .devcontainer/cuda12.5-conda/devcontainer.json --workspace-folder . as before.

…12.5

bdice · 2024-07-02T17:41:24Z

Looks like the CUDA feature is trying to install cuDNN v8, but IIRC it's v9 now, so that's why cuDNN isn't getting installed.

I updated this in d4ef78e. I wasn't sure if we wanted to keep libcudnn8 for any CUDA versions or not. If so, let me know.

trxcllnt · 2024-07-02T18:11:19Z

Yeah we need to install the right cuDNN version based on the CUDA toolkit. Maybe we can make the cuDNN version a feature input variable?

bdice · 2024-07-02T18:31:44Z

Yeah we need to install the right cuDNN version based on the CUDA toolkit. Maybe we can make the cuDNN version a feature input variable?

It looks like cuDNN 9.2.0 is compatible with 11.8 and 12.0-12.5, which would cover all the devcontainers we produce. https://docs.nvidia.com/deeplearning/cudnn/latest/reference/support-matrix.html#support-matrix

trxcllnt · 2024-07-02T18:35:42Z

Yes but not every library works with cuDNN v9 yet (cupy, for example), so we need a variable to allow installing different versions.

bdice · 2024-07-02T18:51:05Z

@trxcllnt I'm not sure how to add a variable. Is this something I modify in matrix.yaml?

bdice · 2024-07-02T18:56:48Z

Maybe I got it right? I guessed. See deba81b and d8f91e9.

features/src/cuda/devcontainer-feature.json

trxcllnt · 2024-07-02T19:36:01Z

/ok to test

trxcllnt · 2024-07-02T19:48:55Z

cuDNN v9 isn't getting installed because they changed the names of the packages between 8 and 9. I'll push a commit that fixes it.

bdice

@trxcllnt I had one question.

features/src/cuda/install.sh

jakirkham · 2024-07-11T07:03:50Z

Do we need to install cxx-compiler somewhere and point CMake to it?

Seeing this on CI:

CMake Error at /usr/share/cmake-3.30/Modules/CMakeDetermineCXXCompiler.cmake:48 (message):
  Could not find compiler set in environment variable CXX:

  /usr/bin/g++.

Call Stack (most recent call first):
  CMakeLists.txt:24 (project)


CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage
-- Configuring incomplete, errors occurred!

trxcllnt · 2024-07-11T19:42:54Z

No, the problem is there's no matrix entries for CUDA 12.5 in dependencies.yaml (e.g. here), causing rapids-dependency-file-generator to output an empty conda environment yaml file and nothing to be installed.

Update to CUDA 12.5.

57ce464

Merge branch 'branch-24.08' into cuda-12.5

f48ea68

bdice requested a review from a team as a code owner July 1, 2024 22:01

bdice requested review from AyodeAwe and removed request for a team July 1, 2024 22:01

bdice added 2 commits July 2, 2024 10:39

Use cudnn 9.

d4ef78e

Merge branch 'cuda-12.5' of github.com:bdice/devcontainers into cuda-…

2952af5

…12.5

bdice added 2 commits July 2, 2024 11:53

Add cuDNN input variable.

deba81b

Try to use cuDNN version.

d8f91e9

trxcllnt reviewed Jul 2, 2024

View reviewed changes

features/src/cuda/devcontainer-feature.json Outdated Show resolved Hide resolved

Merge branch 'branch-24.08' into cuda-12.5

40dfe3c

trxcllnt added 3 commits July 2, 2024 12:59

fix shellcheck lint

63ba6ec

cudnn_version -> cuDNNVersion

84dc512

install different packages for cuDNN v8 and v9

066ddb3

bdice commented Jul 2, 2024

View reviewed changes

features/src/cuda/install.sh Outdated Show resolved Hide resolved

Use CUDNNVERSION=8 by default

8834dc7

trxcllnt merged commit 1bd1bd5 into rapidsai:branch-24.08 Jul 17, 2024
211 of 212 checks passed

bdice mentioned this pull request Jul 22, 2024

Disable CUDA driver requirement. #366

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to CUDA 12.5. #332

Update to CUDA 12.5. #332

bdice commented Jun 26, 2024 •

edited

Loading

bdice commented Jun 26, 2024

trxcllnt commented Jun 26, 2024 •

edited

Loading

bdice commented Jun 27, 2024

bdice commented Jun 27, 2024 •

edited

Loading

trxcllnt commented Jun 27, 2024 •

edited

Loading

bdice commented Jun 28, 2024

bdice commented Jul 1, 2024

bdice commented Jul 1, 2024

trxcllnt commented Jul 2, 2024 •

edited

Loading

trxcllnt commented Jul 2, 2024 •

edited

Loading

trxcllnt commented Jul 2, 2024

bdice commented Jul 2, 2024

bdice commented Jul 2, 2024

bdice commented Jul 2, 2024

trxcllnt commented Jul 2, 2024

bdice commented Jul 2, 2024

trxcllnt commented Jul 2, 2024

bdice commented Jul 2, 2024

bdice commented Jul 2, 2024

trxcllnt commented Jul 2, 2024

trxcllnt commented Jul 2, 2024

bdice left a comment

jakirkham commented Jul 11, 2024

trxcllnt commented Jul 11, 2024

Update to CUDA 12.5. #332

Update to CUDA 12.5. #332

Conversation

bdice commented Jun 26, 2024 • edited Loading

bdice commented Jun 26, 2024

trxcllnt commented Jun 26, 2024 • edited Loading

bdice commented Jun 27, 2024

bdice commented Jun 27, 2024 • edited Loading

trxcllnt commented Jun 27, 2024 • edited Loading

bdice commented Jun 28, 2024

bdice commented Jul 1, 2024

bdice commented Jul 1, 2024

trxcllnt commented Jul 2, 2024 • edited Loading

trxcllnt commented Jul 2, 2024 • edited Loading

trxcllnt commented Jul 2, 2024

bdice commented Jul 2, 2024

bdice commented Jul 2, 2024

bdice commented Jul 2, 2024

trxcllnt commented Jul 2, 2024

bdice commented Jul 2, 2024

trxcllnt commented Jul 2, 2024

bdice commented Jul 2, 2024

bdice commented Jul 2, 2024

trxcllnt commented Jul 2, 2024

trxcllnt commented Jul 2, 2024

bdice left a comment

Choose a reason for hiding this comment

jakirkham commented Jul 11, 2024

trxcllnt commented Jul 11, 2024

bdice commented Jun 26, 2024 •

edited

Loading

trxcllnt commented Jun 26, 2024 •

edited

Loading

bdice commented Jun 27, 2024 •

edited

Loading

trxcllnt commented Jun 27, 2024 •

edited

Loading

trxcllnt commented Jul 2, 2024 •

edited

Loading

trxcllnt commented Jul 2, 2024 •

edited

Loading