Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask cluster attachment fails when running TPCx-BB, query 07, due to receiving UUID (string) instead of GPU sequence number (int) #435

Closed
rfvander opened this issue Nov 2, 2020 · 22 comments

Comments

@rfvander
Copy link

rfvander commented Nov 2, 2020

When I start a DASK cluster with TCP (tpcx_bb/cluster_configuration$ bash cluster-startup.sh TCP) as part of running the TPCx-BB benchmark, query 07, the python script fails immediately in the "attach-to-cluster" call. This is my marked-up main program; I added two print statements. The first is reached, the second is not, because the code times out:
if name == "main":
from xbb_tools.cluster_startup import attach_to_cluster
import cudf
import dask_cudf

config = tpcxbb_argparser()
print('About to attach to cluster')
client, bc = attach_to_cluster(config)
print('Attached to cluster')
run_query(config=config, client=client, query_func=main)

Output:
(rapids-tpcx-bb) robv@computelab-138:/home/scratch.robv_gpu_1/apps/rapids/tpcx-bb/tpcx_bb/queries/q07$ python tpcx_bb_query_07.py --config_file=../../benchmark_runner/benchmark_config.yaml
Using default arguments
About to attach to cluster
Connected to tcp://10.31.37.138:8786
^C^CTraceback (most recent call last):
File "tpcx_bb_query_07.py", line 163, in
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/xbb_tools/cluster_startup.py", line 93, in attach_to_cluster
ucx_config = client.submit(_get_ucx_config).result()
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/client.py", line 222, in result
result = self.client.sync(self._result, callback_timeout=timeout, raiseit=False)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/client.py", line 833, in sync
self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/utils.py", line 337, in sync
e.wait(10)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/threading.py", line 552, in wait
signaled = self._cond.wait(timeout)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/threading.py", line 300, in wait
gotit = waiter.acquire(True, timeout)
KeyboardInterrupt

Scheduler and worker logs:
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://10.31.37.138:8786
distributed.scheduler - INFO - dashboard at: 10.31.37.138:8787
distributed.scheduler - INFO - Receive client connection: Client-39495dd8-1d28-11eb-8d0a-0cc47ab493b2
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Remove client Client-39495dd8-1d28-11eb-8d0a-0cc47ab493b2
distributed.scheduler - INFO - Remove client Client-39495dd8-1d28-11eb-8d0a-0cc47ab493b2
distributed.scheduler - INFO - Close client connection: Client-39495dd8-1d28-11eb-8d0a-0cc47ab493b2
(rapids-tpcx-bb) robv@computelab-138:/home/scratch.robv_gpu_1/apps/rapids/tpcx-bb/tpcx_bb/queries/q07$ cat /tmp/robv/tpcx-bb-dask-logs/worker.log
Traceback (most recent call last):
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/bin/dask-cuda-worker", line 33, in
sys.exit(load_entry_point('dask-cuda==0.16.0a201015', 'console_scripts', 'dask-cuda-worker')())
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/dask_cuda/cli/dask_cuda_worker.py", line 279, in go
main()
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/dask_cuda/cli/dask_cuda_worker.py", line 254, in main
**kwargs,
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/dask_cuda/cuda_worker.py", line 226, in init
for i in range(nprocs)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/dask_cuda/cuda_worker.py", line 226, in
for i in range(nprocs)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/dask_cuda/local_cuda_cluster.py", line 38, in cuda_visible_devices
visible = list(visible)
ValueError: invalid literal for int() with base 10: 'GPU-345e089a-cafa-621e-f2b5-f6f39217baef'

@rfvander
Copy link
Author

rfvander commented Nov 2, 2020

Shortest reproducer:
(rapids-tpcx-bb) robv@computelab-138:/home/scratch.robv_gpu_1/apps/rapids/tpcx-bb/tpcx_bb/queries/q07$ ipython
Python 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 02:25:08)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.18.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: from dask_cuda import LocalCUDACluster
...: cluster = LocalCUDACluster()

ValueError Traceback (most recent call last)
in
1 from dask_cuda import LocalCUDACluster
----> 2 cluster = LocalCUDACluster()
/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/dask_cuda/local_cuda_cluster.py in init(self, n_workers, threads_per_worker, processes, memory_limit, device_memory_limit, CUDA_VISIBLE_DEVICES, data, local_directory, protocol, enable_tcp_over_ucx, enable_infiniband, enable_nvlink, enable_rdmacm, ucx_net_devices, rmm_pool_size, rmm_managed_memory, **kwargs)
157
158 if CUDA_VISIBLE_DEVICES is None:
--> 159 CUDA_VISIBLE_DEVICES = cuda_visible_devices(0)
160 if isinstance(CUDA_VISIBLE_DEVICES, str):
161 CUDA_VISIBLE_DEVICES = CUDA_VISIBLE_DEVICES.split(",")
/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/dask_cuda/local_cuda_cluster.py in cuda_visible_devices(i, visible)
36 except KeyError:
37 visible = range(get_n_gpus())
---> 38 visible = list(visible)
39
40 L = visible[i:] + visible[:i]
ValueError: invalid literal for int() with base 10: 'GPU-345e089a-cafa-621e-f2b5-f6f39217baef'

@rfvander
Copy link
Author

rfvander commented Nov 2, 2020

Explicitly setting CUDA_VISIBLE_DEVICES to the acquired GPU sequence number avoids the error:
(rapids-tpcx-bb) robv@computelab-138:/home/scratch.robv_gpu_1/apps/rapids/tpcx-bb/tpcx_bb/queries/q07$ export CUDA_VISIBLE_DEVICES=0
(rapids-tpcx-bb) robv@computelab-138:/home/scratch.robv_gpu_1/apps/rapids/tpcx-bb/tpcx_bb/queries/q07$ ipython
Python 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 02:25:08)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.18.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from dask_cuda import LocalCUDACluster
...: cluster = LocalCUDACluster()

In [2]:

@beckernick
Copy link
Member

cc @quasiben , have we ever seen a situation in which leaving CUDA_VISIBLE_DEVICES unset has caused the cuda_visible_devices function to pick up the UUIDs instead of the sequence ID?

@pentschev
Copy link
Member

I don't think we've seen this before. @rfvander can you confirm you don't have CUDA_VISIBLE_DEVICES set in your environment when you don't do the export explicitly, maybe checking the output of set | grep CUDA_VISIBLE_DEVICES ?

Besides that, could you paste the output of conda list and what CUDA toolkit and driver version you're using?

@pentschev
Copy link
Member

Could you also please report the output of running the following on the same environment:

python -c "import pynvml; pynvml.nvmlInit(); print(pynvml.nvmlDeviceGetCount())"

@rfvander
Copy link
Author

rfvander commented Nov 2, 2020

I do not set CUDA_VISIBLE_DEVICES myself, but it has a value as soon as I acquire a cluster node in computelab, see below. I am guessing that it is part of the slurm preamble. Am inquiring now.
robv@computelab-138:/home/scratch.robv_gpu_1/apps/florent_p4/StacA2$ set | grep CUDA_VISIBLE_DEVICES
CUDA_VISIBLE_DEVICES=GPU-345e089a-cafa-621e-f2b5-f6f39217baef

The python command fails, due to a missing module:
robv@computelab-138:/home/scratch.robv_gpu_1/apps/florent_p4/StacA2$ python -c "import pynvml; pynvml.nvmlInit(); print(pynvml.nvmlDeviceGetCount())"
Traceback (most recent call last):
File "", line 1, in
ImportError: No module named pynvml

@quasiben
Copy link
Member

quasiben commented Nov 2, 2020

ImportError: No module named pynvml

This looks like the env is not setup correctly. Can you post the contents of the conda environment ? conda list

@pentschev
Copy link
Member

So in your case, it seems like slurm is setting that by the GPU UUID, we don't currently support that -- and TBH, I didn't even know this was allowed. We'll have to work on a fix for that.

It's also a problem that you can't import pynvml, because this is a requirement for newer versions of dask-cuda. Could you confirm what version of dask-cuda you have installed? Anything after 0.14 (and I believe even before that but I don't recall exactly since when), require pynvml. One other thing that may be happening is that you're not activating your environment before running that command, or if you're running python with its full path to pick the correct environment, you may need to do the same before running that.

@rfvander
Copy link
Author

rfvander commented Nov 2, 2020

I can easily unset CUDA_VISIBLE_DEVICES on my end.
Indeed, I ran the python command "raw," before activating any environment.

@pentschev
Copy link
Member

I can easily unset CUDA_VISIBLE_DEVICES on my end.

That will be the short-term solution for your case. I'll fix dask-cuda to accept that too, instead of only allowing integers.

@rfvander
Copy link
Author

rfvander commented Nov 2, 2020

Inside the activated conda environment:
(rapids-tpcx-bb) robv@computelab-138:/home/scratch.robv_gpu_1/apps/rapids/tpcx-bb$ python -c "import pynvml; pynvml.nvmlInit(); print(pynvml.nvmlDeviceGetCount())"
1

@rfvander
Copy link
Author

rfvander commented Nov 2, 2020

With CUDA_VISIBLE_DEVICES unset, I don't get the previous error, but the code still hangs in the DASK cluster attach call, so I ctrl-C out of it:
(rapids-tpcx-bb) robv@computelab-138:/home/scratch.robv_gpu_1/apps/rapids/tpcx-bb/tpcx_bb/queries/q07$ python tpcx_bb_query_07.py --config_file=../../benchmark_runner/benchmark_config.yaml
Using default arguments
About to attach to cluster
Connected to tcp://10.31.37.138:8786
^CTraceback (most recent call last):
File "tpcx_bb_query_07.py", line 163, in
client, bc = attach_to_cluster(config)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/xbb_tools/cluster_startup.py", line 93, in attach_to_cluster
ucx_config = client.submit(_get_ucx_config).result()
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/client.py", line 222, in result
result = self.client.sync(self._result, callback_timeout=timeout, raiseit=False)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/client.py", line 833, in sync
self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/utils.py", line 337, in sync
e.wait(10)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/threading.py", line 552, in wait
signaled = self._cond.wait(timeout)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/threading.py", line 300, in wait
gotit = waiter.acquire(True, timeout)
KeyboardInterrupt
^C
(rapids-tpcx-bb) robv@computelab-138:/home/scratch.robv_gpu_1/apps/rapids/tpcx-bb/tpcx_bb/queries/q07$ cat /tmp/robv/tpcx-bb-dask-logs/worker.log
Traceback (most recent call last):
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/bin/dask-cuda-worker", line 33, in
sys.exit(load_entry_point('dask-cuda==0.16.0a201015', 'console_scripts', 'dask-cuda-worker')())
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/dask_cuda/cli/dask_cuda_worker.py", line 279, in go
main()
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/dask_cuda/cli/dask_cuda_worker.py", line 254, in main
**kwargs,
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/dask_cuda/cuda_worker.py", line 226, in init
for i in range(nprocs)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/dask_cuda/cuda_worker.py", line 226, in
for i in range(nprocs)
File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/nanny.py", line 109, in init
self.scheduler_addr = cfg["address"]
TypeError: 'NoneType' object is not subscriptable
(rapids-tpcx-bb) robv@computelab-138:/home/scratch.robv_gpu_1/apps/rapids/tpcx-bb/tpcx_bb/queries/q07$ cat /tmp/robv/tpcx-bb-dask-logs/scheduler.log
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://10.31.37.138:8786
distributed.scheduler - INFO - dashboard at: 10.31.37.138:8787
distributed.scheduler - INFO - Receive client connection: Client-aafc1c94-1d33-11eb-9bfc-0cc47ab493b2
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Remove client Client-aafc1c94-1d33-11eb-9bfc-0cc47ab493b2
distributed.scheduler - INFO - Remove client Client-aafc1c94-1d33-11eb-9bfc-0cc47ab493b2
distributed.scheduler - INFO - Close client connection: Client-aafc1c94-1d33-11eb-9bfc-0cc47ab493b2

@pentschev
Copy link
Member

What happens if you unset CUDA_VISIBLE_DEVICES and then run python -c "import pynvml; pynvml.nvmlInit(); print(pynvml.nvmlDeviceGetCount())" ? Given slurm is presetting CUDA_VISIBLE_DEVICES, I'm thinking that maybe unsetting it now makes dask-cuda see all devices but perhaps getting stuck on permission or some related issue. Could you also try for this test to verify what is the index for the device that gets set automatically in CUDA_VISIBLE_DEVICES and replace that by its numerical index to verify if we can get past that?

@rfvander
Copy link
Author

rfvander commented Nov 2, 2020

Hi Peter, I've already done all that, and the result is shown in my previous comments.

@pentschev
Copy link
Member

It seems that we're then in TPCx-BB territory, and I'm not very familiar with current internals, can you paste the contents of your benchmark_config.yaml ? Besides that, @beckernick should one be able to run queries (specifically q07 here) with a single-GPU?

@pentschev
Copy link
Member

It seems also that you're running with TCP protocol and not UCX, but it's getting stuck in

File "/home/scratch.robv_gpu_1/apps/rapids/notebooks/xgboost/dask-xgboost/ls/envs/rapids-tpcx-bb/lib/python3.7/site-packages/xbb_tools/cluster_startup.py", line 93, in attach_to_cluster
ucx_config = client.submit(_get_ucx_config).result()

@beckernick do you know if that succeeds with TCP? I haven't tested that myself so I'm not sure if this should be executed when protocol="tcp".

@rfvander
Copy link
Author

rfvander commented Nov 2, 2020

(rapids-tpcx-bb) robv@computelab-138:/home/scratch.robv_gpu_1/apps/rapids/tpcx-bb/tpcx_bb/queries/q07$ cat ../../benchmark_runner/benchmark_config.yaml

benchmark config yaml

Please fill these accordingly

data_dir: /home/scratch.robv_gpu_1/apps/rapids/tpcx-bb/sf100/parquet-1/
output_dir: ./
file_format: parquet
output_filetype: parquet
split_row_groups: False
repartition_small_table: True
benchmark_runner_include_bsql: False

cluster_host: 10.31.37.138
cluster_port: 8786
dask_profile: False
dask_dir: ./
32GB_workers: 0

verify_results: False
verify_dir:

sheet: TPCx-BB
tab: SF1000 Benchmarking Matrix

@pentschev
Copy link
Member

Taking a more thorough look at all the logs above it seems the client doesn't succeed because there are no workers connected to the scheduler. Can you confirm how you're running dask-cuda-worker? The scheduler logs don't show workers registering, so we have an issue before you even launch your query.

@beckernick
Copy link
Member

_get_ucx_config will run fine with TCP. It's hard to know specifically what's going on without more visibility into what the workers are doing. It's likely there something going on specific to this cluster configuration / IP setup. Let's take this discussion to Slack

@rfvander
Copy link
Author

rfvander commented Nov 2, 2020

Yes, the connection is never made, evidenced by the fact that the second print statement is not reached, as I reported earlier.

@rfvander rfvander closed this as completed Nov 2, 2020
@rfvander
Copy link
Author

rfvander commented Nov 2, 2020

Remaining issue was related to cluster setup configuration, fixed by @beckernick

@pentschev
Copy link
Member

I just opened #437 to resolve the UUID issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants