-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask cluster attachment fails when running TPCx-BB, query 07, due to receiving UUID (string) instead of GPU sequence number (int) #435
Comments
Shortest reproducer:
|
Explicitly setting CUDA_VISIBLE_DEVICES to the acquired GPU sequence number avoids the error: In [1]: from dask_cuda import LocalCUDACluster In [2]: |
cc @quasiben , have we ever seen a situation in which leaving CUDA_VISIBLE_DEVICES unset has caused the cuda_visible_devices function to pick up the UUIDs instead of the sequence ID? |
I don't think we've seen this before. @rfvander can you confirm you don't have Besides that, could you paste the output of |
Could you also please report the output of running the following on the same environment:
|
I do not set CUDA_VISIBLE_DEVICES myself, but it has a value as soon as I acquire a cluster node in computelab, see below. I am guessing that it is part of the slurm preamble. Am inquiring now. The python command fails, due to a missing module: |
This looks like the env is not setup correctly. Can you post the contents of the conda environment ? |
So in your case, it seems like slurm is setting that by the GPU UUID, we don't currently support that -- and TBH, I didn't even know this was allowed. We'll have to work on a fix for that. It's also a problem that you can't import pynvml, because this is a requirement for newer versions of dask-cuda. Could you confirm what version of dask-cuda you have installed? Anything after 0.14 (and I believe even before that but I don't recall exactly since when), require pynvml. One other thing that may be happening is that you're not activating your environment before running that command, or if you're running |
I can easily unset CUDA_VISIBLE_DEVICES on my end. |
That will be the short-term solution for your case. I'll fix dask-cuda to accept that too, instead of only allowing integers. |
Inside the activated conda environment: |
With CUDA_VISIBLE_DEVICES unset, I don't get the previous error, but the code still hangs in the DASK cluster attach call, so I ctrl-C out of it: |
What happens if you |
Hi Peter, I've already done all that, and the result is shown in my previous comments. |
It seems that we're then in TPCx-BB territory, and I'm not very familiar with current internals, can you paste the contents of your |
It seems also that you're running with TCP protocol and not UCX, but it's getting stuck in File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/xbb_tools/cluster_startup.py", line 93, in attach_to_cluster
ucx_config = client.submit(_get_ucx_config).result() @beckernick do you know if that succeeds with TCP? I haven't tested that myself so I'm not sure if this should be executed when |
(rapids-tpcx-bb) robv@computelab-138:/home/scratch.robv_gpu_1/apps/rapids/tpcx-bb/tpcx_bb/queries/q07$ cat ../../benchmark_runner/benchmark_config.yaml benchmark config yamlPlease fill these accordinglydata_dir: /home/scratch.robv_gpu_1/apps/rapids/tpcx-bb/sf100/parquet-1/ cluster_host: 10.31.37.138 verify_results: False sheet: TPCx-BB |
Taking a more thorough look at all the logs above it seems the client doesn't succeed because there are no workers connected to the scheduler. Can you confirm how you're running |
|
Yes, the connection is never made, evidenced by the fact that the second print statement is not reached, as I reported earlier. |
Remaining issue was related to cluster setup configuration, fixed by @beckernick |
I just opened #437 to resolve the UUID issue. |
When I start a DASK cluster with TCP (tpcx_bb/cluster_configuration$ bash cluster-startup.sh TCP) as part of running the TPCx-BB benchmark, query 07, the python script fails immediately in the "attach-to-cluster" call. This is my marked-up main program; I added two print statements. The first is reached, the second is not, because the code times out:
if name == "main":
from xbb_tools.cluster_startup import attach_to_cluster
import cudf
import dask_cudf
Output:
(rapids-tpcx-bb) robv@computelab-138:/home/scratch.robv_gpu_1/apps/rapids/tpcx-bb/tpcx_bb/queries/q07$ python tpcx_bb_query_07.py --config_file=../../benchmark_runner/benchmark_config.yaml
Using default arguments
About to attach to cluster
Connected to tcp://10.31.37.138:8786
^C^CTraceback (most recent call last):
File "tpcx_bb_query_07.py", line 163, in
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/xbb_tools/cluster_startup.py", line 93, in attach_to_cluster
ucx_config = client.submit(_get_ucx_config).result()
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/client.py", line 222, in result
result = self.client.sync(self._result, callback_timeout=timeout, raiseit=False)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/client.py", line 833, in sync
self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/utils.py", line 337, in sync
e.wait(10)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/threading.py", line 552, in wait
signaled = self._cond.wait(timeout)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/threading.py", line 300, in wait
gotit = waiter.acquire(True, timeout)
KeyboardInterrupt
Scheduler and worker logs:
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://10.31.37.138:8786
distributed.scheduler - INFO - dashboard at: 10.31.37.138:8787
distributed.scheduler - INFO - Receive client connection: Client-39495dd8-1d28-11eb-8d0a-0cc47ab493b2
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Remove client Client-39495dd8-1d28-11eb-8d0a-0cc47ab493b2
distributed.scheduler - INFO - Remove client Client-39495dd8-1d28-11eb-8d0a-0cc47ab493b2
distributed.scheduler - INFO - Close client connection: Client-39495dd8-1d28-11eb-8d0a-0cc47ab493b2
(rapids-tpcx-bb) robv@computelab-138:/home/scratch.robv_gpu_1/apps/rapids/tpcx-bb/tpcx_bb/queries/q07$ cat /tmp/robv/tpcx-bb-dask-logs/worker.log
Traceback (most recent call last):
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/bin/dask-cuda-worker", line 33, in
sys.exit(load_entry_point('dask-cuda==0.16.0a201015', 'console_scripts', 'dask-cuda-worker')())
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/dask_cuda/cli/dask_cuda_worker.py", line 279, in go
main()
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/dask_cuda/cli/dask_cuda_worker.py", line 254, in main
**kwargs,
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/dask_cuda/cuda_worker.py", line 226, in init
for i in range(nprocs)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/dask_cuda/cuda_worker.py", line 226, in
for i in range(nprocs)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/dask_cuda/local_cuda_cluster.py", line 38, in cuda_visible_devices
visible = list(visible)
ValueError: invalid literal for int() with base 10: 'GPU-345e089a-cafa-621e-f2b5-f6f39217baef'
The text was updated successfully, but these errors were encountered: