Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Databricks multi-node to CUDA 12 #410

Open
jacobtomlinson opened this issue Aug 8, 2024 · 0 comments
Open

Update Databricks multi-node to CUDA 12 #410

jacobtomlinson opened this issue Aug 8, 2024 · 0 comments
Labels
bug Something isn't working platform/databricks

Comments

@jacobtomlinson
Copy link
Member

Our current docs for multi-node Databricks cover the following process:

  • Create startup script that installs RAPIDS and dask-databricks, then runs dask-databricks
  • Create a MNMG cluster that uses the 14.2 (Scala 2.12, Spark 3.5.0) runtime
  • Select Use your own Docker container* and enter the image databricksruntime/gpu-tensorflow:cuda11.8 or databricksruntime/gpu-pytorch:cuda11.8.

The container images use CUDA 11.8 and there are no CUDA 12 images available from Databricks.

The single-node instructions don't use a custom container at all, so in theory we should be able to do the same with he multi-node instructions.

In practice if you omit the custom container the init scripts fails. The logs show that NVML can't be found during Dask startup. This makes me think that either the NVIDIA Driver or CUDA toolkit are not installed at the time the init script runs and are installed later.

We should find a way to start up dask-databricks without using a custom container and update the documentation.

@jacobtomlinson jacobtomlinson added bug Something isn't working platform/databricks labels Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working platform/databricks
Projects
None yet
Development

No branches or pull requests

1 participant