Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-gpu dbscan Segmentation fault #5961

Open
Cocoaxx opened this issue Jul 10, 2024 · 12 comments
Open

multi-gpu dbscan Segmentation fault #5961

Cocoaxx opened this issue Jul 10, 2024 · 12 comments
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@Cocoaxx
Copy link

Cocoaxx commented Jul 10, 2024

Describe the bug
when I try to use multi-gpu dbscan, I got (Segmentation fault: invalid permissions for mapped object at address 0x7f0c8e0007c0)
image

Steps/Code to reproduce bug
image

Environment details (please complete the following information):

  • Linux Distro/Architecture: [CentOS 7.2]
  • GPU Model/Driver: [V100 and driver 470.239.06]
  • CUDA: [11.8]
  • Method of cuDF & cuML install: [pip]
    `Package Version

-arkupSafe 2.0.1
-enus-api-base 1.3.14
aiohttp 3.9.5
aiosignal 1.3.1
anyio 3.7.1
argon2-cffi 20.1.0
asgiref 3.8.1
async-generator 1.10
async-timeout 4.0.3
attrs 20.3.0
autovizwidget 0.21.0
Babel 2.9.0
backcall 0.2.0
bleach 3.3.0
cachetools 5.3.3
certifi 2020.12.5
cffi 1.14.5
chardet 4.0.0
click 8.1.7
cloudpickle 3.0.0
comm 0.2.2
contourpy 1.1.1
cos-python-sdk-v5 1.9.30
coscmd 1.8.5.37
crcmod 1.7
cubinlinker-cu11 0.3.0.post1
cuda-python 11.8.2
cudf-cu11 23.4.1
cugraph-cu11 23.4.1
cuml-cu11 23.4.1
cupy-cuda11x 11.6.0
cycler 0.12.1
Cython 3.0.8
dask 2023.3.2
dask-cuda 23.4.0
dask-cudf-cu11 23.4.1
DateTime 5.5
decorator 5.0.7
defusedxml 0.7.1
deprecation 2.1.0
dill 0.3.8
distributed 2023.3.2.1
dulwich 0.22.1
entrypoints 0.3
exceptiongroup 1.2.1
fastapi 0.70.0
fastrlock 0.8.2
filelock 3.9.0
fonttools 4.49.0
frozenlist 1.4.1
fsspec 2024.2.0
fuzzywuzzy 0.18.0
h11 0.14.0
hdijupyterutils 0.19.1
huggingface-hub 0.21.3
idna 2.10
importlib-metadata 7.0.1
importlib-resources 6.1.2
ipykernel 5.5.3
ipython 7.22.0
ipython-genutils 0.2.0
ipywidgets 8.1.3
jedi 0.18.0
Jinja2 2.11.3
joblib 1.3.2
json5 0.9.5
jsonschema 3.2.0
jupyter-client 6.2.0
jupyter-core 4.7.1
jupyter-packaging 0.9.1
jupyter-server 1.6.1
jupyterlab 3.0.14
jupyterlab-pygments 0.1.2
jupyterlab-server 2.4.0
jupyterlab-widgets 3.0.11
kiwisolver 1.4.5
llvmlite 0.39.1
locket 1.0.0
markdown-it-py 3.0.0
MarkupSafe 2.0.1
matplotlib 3.7.5
mdurl 0.1.2
mistune 0.8.4
mpmath 1.3.0
msgpack 1.0.8
multidict 6.0.5
nbclassic 0.2.7
nbclient 0.5.3
nbconvert 6.0.7
nbformat 5.1.3
nest-asyncio 1.5.1
networkx 3.0
nltk 3.8.1
notebook 6.3.0
numba 0.56.4
numpy 1.23.5
nvtx 0.2.10
packaging 20.9
pandas 1.5.3
pandocfilters 1.4.3
parso 0.8.2
partd 1.4.1
pexpect 4.8.0
pickleshare 0.7.5
pillow 10.2.0
pip 21.0.1
pip-magic 0.2.3
plotly 5.22.0
prettytable 3.10.0
prometheus-client 0.10.1
prompt-toolkit 3.0.18
protobuf 4.21.12
psutil 5.9.8
ptxcompiler-cu11 0.7.0.post1
ptyprocess 0.7.0
pyarrow 10.0.1
pycparser 2.20
pycryptodome 3.20.0
pydantic 1.10.17
pygments 2.18.0
pylibcugraph-cu11 23.4.1
pylibraft-cu11 23.4.1
pynvml 11.4.1
pyparsing 2.4.7
pyrsistent 0.17.3
python-dateutil 2.8.1
pytz 2021.1
PyYAML 6.0.1
pyzmq 22.0.3
raft-dask-cu11 23.4.1
regex 2023.12.25
requests 2.25.1
rich 13.7.1
rmm-cu11 23.4.1
safetensors 0.4.2
scikit-learn 1.3.2
scipy 1.10.1
seaborn 0.13.2
Send2Trash 1.5.0
sentence-transformers 2.2.2
sentencepiece 0.2.0
setuptools 52.0.0.post20210125
six 1.15.0
sniffio 1.2.0
sortedcontainers 2.4.0
sparkmagic 0.19.1.12
starlette 0.16.0
supervisor 4.2.5
sympy 1.12
tblib 3.0.0
tenacity 8.5.0
terminado 0.9.4
testpath 0.4.4
threadpoolctl 3.3.0
tokenizers 0.15.2
tomlkit 0.7.0
toolz 0.12.1
torch 2.1.0+cu118
torchaudio 2.1.0+cu118
torchvision 0.16.0+cu118
tornado 6.1
tqdm 4.66.2
traitlets 5.0.5
transformers 4.38.2
treelite 3.2.0
treelite-runtime 3.2.0
triton 2.1.0
typing-extensions 4.10.0
ucx-py-cu11 0.31.1
urllib3 1.26.4
urwid 2.6.15
uvicorn 0.15.0
venus-api-all 1.3.21
venus-api-base 1.3.21
venus-boot 1.3.21
venus-extension 0.8.1
venus-flow 1.3.21
venus-flow-operator 1.3.21
venus-mdfs 0.1.0
venus-ml 1.3.21
venus-sdk 1.3.21
venus-tools 1.3.21
wcwidth 0.2.5
webencodings 0.5.1
wheel 0.36.2
widgetsnbextension 4.0.11
xmltodict 0.13.0
yarl 1.9.4
zict 3.0.0
zipp 3.17.0
zope.interface 6.4.post2`

@Cocoaxx Cocoaxx added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jul 10, 2024
@dantegd
Copy link
Member

dantegd commented Jul 10, 2024

Thanks for the issue @Cocoaxx, the permissions issue makes me believe this might have to do with UCX in the system, with the first warning in the trace being suspicious. Maybe someone like @pentschev would know if I'm looking in the correct place to triage this issue.

@pentschev
Copy link
Member

The warning saying transports 'cuda_copy', ... are not available means UCX wasn't compiled with CUDA support. I also notice you have ucx-py-cu11 installed but no libucx-cu11 which suggests you probably installed UCX and UCX-Py from source. If that's the case, I would suggest relying on libucx-cu11 instead of a UCX system install, if that's not possible then you would have to recompile UCX with --with-cuda=$CUDA_HOME, where CUDA_HOME points to the CUDA's system installation, generally /usr/local/cuda but can be elsewhere in your system.

@Cocoaxx
Copy link
Author

Cocoaxx commented Jul 11, 2024

thank you for your quick reply. I will have a try. By the way, I have to use python 3.8 and rapids 23.4, but I find that ucx-py needs python version >= 3.9?
my dockerfile install instruction is like this
RUN source ~/.bashrc
&& conda deactivate && conda activate env-3.8.8
&& pip install protobuf==3.20.1
&& pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com
&& pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com
&& pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com
&& pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
&& pip install MarkupSafe==2.0.1
&& pip install scikit-learn
&& pip install transformers==4.38.2
&& pip install sentence-transformers==2.2.2
&& wget -P /tmp $GENERIC_REPO_URL/cpu/clean-layer.sh
&& sh /tmp/clean-layer.sh
&& cd /data/
&& ln -s miniconda3/envs/env-3.8.8 anaconda3

@Cocoaxx
Copy link
Author

Cocoaxx commented Jul 11, 2024

Now, I upgrade python version to 3.9, and still encounter this problem. I try to install libucx_cu11 with whl file, but I got ERROR: libucx_cu11-1.16.0.post1-py3-none-manylinux_2_28_x86_64.whl is not a supported wheel on this platform.
Then I try to install ucx from source with the doc https://ucx-py.readthedocs.io/en/latest/install.html#source.
I install ucx 1.17.0 from source with configure command ../contrib/configure-release --prefix=/data/miniconda3/ --with-cuda=/usr/local/cuda-11.8 --enable-mt --without-go --without-java and the build log like this
image
Then I try to reinstall ucx-py, but I got this error
image
Have I overlooked any important steps? Please give me some suggestions.

@pentschev
Copy link
Member

It seems like you're using conda, in that case why are you attempting to install RAPIDS (cuML and UCX-Py both inclusive) for PyPI? A much easier choice is to install all RAPIDS packages with conda, you can have a look at the RAPIDS install selector tool for instructions.

This information is irrelevant if you use conda like I suggested above, but just for completeness: you specified --prefix=/data/miniconda3/ which is the path where conda is installed but you would have to install it to your conda environment which is --prefix=$CONDA_PREFIX, which is also what the UCX-Py documentation says. Finally, the latest picture seems like you're trying to install ucx-py=0.30 which is very old, for RAPIDS 24.04 the matching version would be ucx-py=0.37, for RAPIDS 24.06 the matching version is ucx-py=0.38, and so on.

@Cocoaxx
Copy link
Author

Cocoaxx commented Jul 11, 2024

Our images are all tlinux, similar to CentOS, not Ubuntu. RAPIDS 24.04 don't support it. We try to install cuml cudf 23.04 which can work well with single gpu, but got error when use multi-gpu. Is there any way to solve this problem?

@pentschev
Copy link
Member

It's true that we don't provide system packages and docker images beyond RockyLinux and Ubuntu. However, with a conda install (which you do have, according to the conda deactivate && conda activate env-3.8.8 line you posted above) you should be able to install all RAPIDS packages, including UCX/UCX-Py. With that, you can install first all the RAPIDS dependencies on your conda environment and then use pip to install anything else you need that's potentially not available from conda-forge. What I'm suggesting is something like this:

conda create -n env-3.8.8 -c rapidsai -c conda-forge -c nvidia  cudf=24.06 cuml=24.06 cugraph=24.06 python=3.9 cuda-version=11.8
&& conda activate env-3.8.8
&& conda install ...
&& pip install ...

The above will be the lowest barrier for you, and cuml/cugraph both will automatically install UCX/UCX-Py as dependencies. I also suggest using RAPIDS 24.06 as 24.04 is already old stable and we cannot provide support for it.

If you still need to build things from source for a different reason, then the next step for you would be to check what I said previously:

This information is irrelevant if you use conda like I suggested above, but just for completeness: you specified --prefix=/data/miniconda3/ which is the path where conda is installed but you would have to install it to your conda environment which is --prefix=$CONDA_PREFIX, which is also what the UCX-Py documentation says. Finally, the latest picture seems like you're trying to install ucx-py=0.30 which is very old, for RAPIDS 24.04 the matching version would be ucx-py=0.37, for RAPIDS 24.06 the matching version is ucx-py=0.38, and so on.

@pentschev
Copy link
Member

One more piece of information that I've now confirmed with others more experienced than me is RAPIDS 24.04 requires glibc>=2.17 and RAPIDS 24.06+ requires glibc>=2.28, see rapidsai/build-planning#23 for more information. Therefore, for the conda install I proposed above to work you must ensure your system provides at least the minimum RAPIDS required version.

@Cocoaxx
Copy link
Author

Cocoaxx commented Jul 13, 2024

Thank you for your suggestion. Now I can install RAPIDS on my machine and use the multi GPU DBSCAN algorithm. But when I used it on the cloud development machine, I encountered another error.
企业微信截图_85919a95-8022-403d-b27a-393bdf9fd2cc
I suspect it's a problem with the graphics card driver version. The CUDA version of the development machine is 11.8 and the driver version is 450.156.00. But RAPIDS needs version 520.61.05 or newer, am I correct?
企业微信截图_ff56edaa-0d97-4dc4-988a-a257851031c8

@pentschev
Copy link
Member

It's hard to say for sure, but CUDA 11.0 hasn't been supported since 2022, RAPIDS supports a minimum of CUDA 11.2 which requires 470.42.01 minimum. To take advantage of CUDA 11.8 features you'll indeed need 520.61.05, although it will run on 470.42.01 due to CUDA Enhanced Compatiblity with newer features being disabled.

@Cocoaxx
Copy link
Author

Cocoaxx commented Jul 19, 2024

Thank you for your reply. But when I run rapids on 2xA10, cuda11.8 and 470.141.03, I got error like this,

[1721380813.194737] [VM-192-150-centos:2662 :0] parser.c:2036 UCX WARN unused environment variables: UCX_WARN_UNUSED_ENV_VARS (maybe: UCX_WARN_UNUSED_ENV_VARS?); UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1721380813.194737] [VM-192-150-centos:2662 :0] parser.c:2036 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
Dask CUDA Cluster created and client connected.
Sample data generated.
DBSCAN model defined.
VM-192-150-centos:2662:2857 [32750] NCCL INFO Bootstrap : Using eth0:9.130.192.150<0>
VM-192-150-centos:2662:2857 [32751] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
VM-192-150-centos:2662:2857 [32750] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
VM-192-150-centos:2662:2857 [0] NCCL INFO NET/Plugin: Using internal network plugin.

VM-192-150-centos:2871:2871 [32523] misc/cudawrap.cc:182 NCCL WARN Cuda failure 'CUDA driver version is insufficient for CUDA runtime version'

VM-192-150-centos:2871:2871 [1868963956] init.cc:1832 NCCL WARN Cuda failure 'CUDA driver version is insufficient for CUDA runtime version'

VM-192-150-centos:2866:2866 [32677] misc/cudawrap.cc:182 NCCL WARN Cuda failure 'CUDA driver version is insufficient for CUDA runtime version'

VM-192-150-centos:2866:2866 [1868963956] init.cc:1832 NCCL WARN Cuda failure 'CUDA driver version is insufficient for CUDA runtime version'
2024-07-19 17:20:24,037 - distributed.worker - WARNING - Run Failed
Function: _func_init_all
args: (b'2\x92#\x0f\xd4+K\\x9bj\x9cy\xff\xd7eD', b"\xbc'|\xd5\xff\x02b\xa4\x02\x00\xa6!\t\x82\xc0\x96\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80\x1b\x147\xef\x7f\x00\x00\x80\xcdv9\xef\x7f\x00\x00\xe0\xe6o\x82\xef\x7f\x00\x00\xf8\xe6o\x82\xef\x7f\x00\x00\x9f\xf9N\x00\x00\x00\x00\x00\x96\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00 (v9\xef\x7f\x00\x00pIv9\xef\x7f\x00", True, {'ucx://127.0.0.1:35807': {'rank': 1, 'port': 41683}, 'ucx://127.0.0.1:55299': {'rank': 0, 'port': 33303}}, False, 0)
kwargs: {'dask_worker': <Worker 'ucx://127.0.0.1:55299', name: 0, status: running, stored: 1, running: 0/1, ready: 0, comm: 0, waiting: 0>}
Traceback (most recent call last):
File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/distributed/worker.py", line 3185, in run
result = await function(*args, **kwargs)
File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/raft_dask/common/comms.py", line 450, in _func_init_all
_func_init_nccl(sessionId, uniqueId, dask_worker=dask_worker)
File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/raft_dask/common/comms.py", line 515, in _func_init_nccl
n.init(nWorkers, uniqueId, wid)
File "nccl.pyx", line 151, in raft_dask.common.nccl.nccl.init
RuntimeError: NCCL_ERROR: b'unhandled cuda error (run with NCCL_DEBUG=INFO for details)'
2024-07-19 17:20:24,036 - distributed.worker - WARNING - Run Failed
Function: _func_init_all
args: (b'2\x92#\x0f\xd4+K\\x9bj\x9cy\xff\xd7eD', b"\xbc'|\xd5\xff\x02b\xa4\x02\x00\xa6!\t\x82\xc0\x96\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80\x1b\x147\xef\x7f\x00\x00\x80\xcdv9\xef\x7f\x00\x00\xe0\xe6o\x82\xef\x7f\x00\x00\xf8\xe6o\x82\xef\x7f\x00\x00\x9f\xf9N\x00\x00\x00\x00\x00\x96\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00 (v9\xef\x7f\x00\x00pIv9\xef\x7f\x00", True, {'ucx://127.0.0.1:35807': {'rank': 1, 'port': 41683}, 'ucx://127.0.0.1:55299': {'rank': 0, 'port': 33303}}, False, 0)
kwargs: {'dask_worker': <Worker 'ucx://127.0.0.1:35807', name: 1, status: running, stored: 1, running: 0/1, ready: 0, comm: 0, waiting: 0>}
Traceback (most recent call last):
File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/distributed/worker.py", line 3185, in run
result = await function(*args, **kwargs)
File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/raft_dask/common/comms.py", line 450, in _func_init_all
_func_init_nccl(sessionId, uniqueId, dask_worker=dask_worker)
File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/raft_dask/common/comms.py", line 515, in _func_init_nccl
n.init(nWorkers, uniqueId, wid)
File "nccl.pyx", line 151, in raft_dask.common.nccl.nccl.init
RuntimeError: NCCL_ERROR: b'unhandled cuda error (run with NCCL_DEBUG=INFO for details)'
Traceback (most recent call last):
File "/workspace/user_code/nickname_seq_cluster/test.py", line 47, in
labels = dbscan.fit_predict(ddf)
File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/cuml/dask/cluster/dbscan.py", line 160, in fit_predict
self.fit(X, out_dtype)
File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/cuml/internals/memory_utils.py", line 87, in cupy_rmm_wrapper
return func(*args, **kwargs)
File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/cuml/dask/cluster/dbscan.py", line 119, in fit
comms.init()
File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/raft_dask/common/comms.py", line 200, in init
self.client.run(
File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/distributed/client.py", line 2991, in run
return self.sync(
File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/distributed/utils.py", line 358, in sync
return sync(
File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/distributed/utils.py", line 434, in sync
raise error
File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/distributed/utils.py", line 408, in f
result = yield future
File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/tornado/gen.py", line 766, in run
value = future.result()
File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/distributed/client.py", line 2896, in _run
raise exc
File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/raft_dask/common/comms.py", line 450, in _func_init_all
_func_init_nccl(sessionId, uniqueId, dask_worker=dask_worker)
File "/data/miniconda3/envs/env-3.9/lib/python3.9/site-packages/raft_dask/common/comms.py", line 515, in _func_init_nccl
n.init(nWorkers, uniqueId, wid)
File "nccl.pyx", line 151, in raft_dask.common.nccl.nccl.init
RuntimeError: NCCL_ERROR: b'unhandled cuda error (run with NCCL_DEBUG=INFO for details)'
2024-07-19 17:20:24,207 - distributed.scheduler - ERROR - Removing worker 'ucx://127.0.0.1:55299' caused the cluster to lose scattered data, which can't be recovered: {'DataFrame-a9260fb655755d3cde1fc36cae8236b9'} (stimulus_id='worker-send-comm-fail-1721380824.2074068')

@pentschev
Copy link
Member

The error seems to stem from:

VM-192-150-centos:2871:2871 [32523] misc/cudawrap.cc:182 NCCL WARN Cuda failure 'CUDA driver version is insufficient for CUDA runtime version'

@cjnolet @viclafargue would you be able to help here with the NCCL errors in RAFT? What is the minimum required driver version for it, the user is running CUDA 11.8 on 470.141.03 (CUDA 11.2), would an upgrade of the driver be required or perhaps a downgrade to CUDA 11.2 build for their system?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants