Update Databricks multi-node to CUDA 12 #410

jacobtomlinson · 2024-08-08T13:35:51Z

Our current docs for multi-node Databricks cover the following process:

Create startup script that installs RAPIDS and dask-databricks, then runs dask-databricks
Create a MNMG cluster that uses the 14.2 (Scala 2.12, Spark 3.5.0) runtime
Select Use your own Docker container* and enter the image databricksruntime/gpu-tensorflow:cuda11.8 or databricksruntime/gpu-pytorch:cuda11.8.

The container images use CUDA 11.8 and there are no CUDA 12 images available from Databricks.

The single-node instructions don't use a custom container at all, so in theory we should be able to do the same with he multi-node instructions.

In practice if you omit the custom container the init scripts fails. The logs show that NVML can't be found during Dask startup. This makes me think that either the NVIDIA Driver or CUDA toolkit are not installed at the time the init script runs and are installed later.

We should find a way to start up dask-databricks without using a custom container and update the documentation.

The text was updated successfully, but these errors were encountered:

jacobtomlinson added bug Something isn't working platform/databricks labels Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Databricks multi-node to CUDA 12 #410

Update Databricks multi-node to CUDA 12 #410

jacobtomlinson commented Aug 8, 2024

Update Databricks multi-node to CUDA 12 #410

Update Databricks multi-node to CUDA 12 #410

Comments

jacobtomlinson commented Aug 8, 2024