-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi-gpu dbscan Segmentation fault #5961
Comments
Thanks for the issue @Cocoaxx, the permissions issue makes me believe this might have to do with UCX in the system, with the first warning in the trace being suspicious. Maybe someone like @pentschev would know if I'm looking in the correct place to triage this issue. |
The warning saying |
thank you for your quick reply. I will have a try. By the way, I have to use python 3.8 and rapids 23.4, but I find that ucx-py needs python version >= 3.9? |
It seems like you're using conda, in that case why are you attempting to install RAPIDS (cuML and UCX-Py both inclusive) for PyPI? A much easier choice is to install all RAPIDS packages with conda, you can have a look at the RAPIDS install selector tool for instructions. This information is irrelevant if you use conda like I suggested above, but just for completeness: you specified |
Our images are all tlinux, similar to CentOS, not Ubuntu. RAPIDS 24.04 don't support it. We try to install cuml cudf 23.04 which can work well with single gpu, but got error when use multi-gpu. Is there any way to solve this problem? |
It's true that we don't provide system packages and docker images beyond RockyLinux and Ubuntu. However, with a conda install (which you do have, according to the
The above will be the lowest barrier for you, and If you still need to build things from source for a different reason, then the next step for you would be to check what I said previously:
|
One more piece of information that I've now confirmed with others more experienced than me is RAPIDS 24.04 requires |
It's hard to say for sure, but CUDA 11.0 hasn't been supported since 2022, RAPIDS supports a minimum of CUDA 11.2 which requires 470.42.01 minimum. To take advantage of CUDA 11.8 features you'll indeed need 520.61.05, although it will run on 470.42.01 due to CUDA Enhanced Compatiblity with newer features being disabled. |
Thank you for your reply. But when I run rapids on 2xA10, cuda11.8 and 470.141.03, I got error like this, [1721380813.194737] [VM-192-150-centos:2662 :0] parser.c:2036 UCX WARN unused environment variables: UCX_WARN_UNUSED_ENV_VARS (maybe: UCX_WARN_UNUSED_ENV_VARS?); UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?) VM-192-150-centos:2871:2871 [32523] misc/cudawrap.cc:182 NCCL WARN Cuda failure 'CUDA driver version is insufficient for CUDA runtime version' VM-192-150-centos:2871:2871 [1868963956] init.cc:1832 NCCL WARN Cuda failure 'CUDA driver version is insufficient for CUDA runtime version' VM-192-150-centos:2866:2866 [32677] misc/cudawrap.cc:182 NCCL WARN Cuda failure 'CUDA driver version is insufficient for CUDA runtime version' VM-192-150-centos:2866:2866 [1868963956] init.cc:1832 NCCL WARN Cuda failure 'CUDA driver version is insufficient for CUDA runtime version' |
The error seems to stem from:
@cjnolet @viclafargue would you be able to help here with the NCCL errors in RAFT? What is the minimum required driver version for it, the user is running CUDA 11.8 on 470.141.03 (CUDA 11.2), would an upgrade of the driver be required or perhaps a downgrade to CUDA 11.2 build for their system? |
Describe the bug
when I try to use multi-gpu dbscan, I got (Segmentation fault: invalid permissions for mapped object at address 0x7f0c8e0007c0)
Steps/Code to reproduce bug
Environment details (please complete the following information):
`Package Version
-arkupSafe 2.0.1
-enus-api-base 1.3.14
aiohttp 3.9.5
aiosignal 1.3.1
anyio 3.7.1
argon2-cffi 20.1.0
asgiref 3.8.1
async-generator 1.10
async-timeout 4.0.3
attrs 20.3.0
autovizwidget 0.21.0
Babel 2.9.0
backcall 0.2.0
bleach 3.3.0
cachetools 5.3.3
certifi 2020.12.5
cffi 1.14.5
chardet 4.0.0
click 8.1.7
cloudpickle 3.0.0
comm 0.2.2
contourpy 1.1.1
cos-python-sdk-v5 1.9.30
coscmd 1.8.5.37
crcmod 1.7
cubinlinker-cu11 0.3.0.post1
cuda-python 11.8.2
cudf-cu11 23.4.1
cugraph-cu11 23.4.1
cuml-cu11 23.4.1
cupy-cuda11x 11.6.0
cycler 0.12.1
Cython 3.0.8
dask 2023.3.2
dask-cuda 23.4.0
dask-cudf-cu11 23.4.1
DateTime 5.5
decorator 5.0.7
defusedxml 0.7.1
deprecation 2.1.0
dill 0.3.8
distributed 2023.3.2.1
dulwich 0.22.1
entrypoints 0.3
exceptiongroup 1.2.1
fastapi 0.70.0
fastrlock 0.8.2
filelock 3.9.0
fonttools 4.49.0
frozenlist 1.4.1
fsspec 2024.2.0
fuzzywuzzy 0.18.0
h11 0.14.0
hdijupyterutils 0.19.1
huggingface-hub 0.21.3
idna 2.10
importlib-metadata 7.0.1
importlib-resources 6.1.2
ipykernel 5.5.3
ipython 7.22.0
ipython-genutils 0.2.0
ipywidgets 8.1.3
jedi 0.18.0
Jinja2 2.11.3
joblib 1.3.2
json5 0.9.5
jsonschema 3.2.0
jupyter-client 6.2.0
jupyter-core 4.7.1
jupyter-packaging 0.9.1
jupyter-server 1.6.1
jupyterlab 3.0.14
jupyterlab-pygments 0.1.2
jupyterlab-server 2.4.0
jupyterlab-widgets 3.0.11
kiwisolver 1.4.5
llvmlite 0.39.1
locket 1.0.0
markdown-it-py 3.0.0
MarkupSafe 2.0.1
matplotlib 3.7.5
mdurl 0.1.2
mistune 0.8.4
mpmath 1.3.0
msgpack 1.0.8
multidict 6.0.5
nbclassic 0.2.7
nbclient 0.5.3
nbconvert 6.0.7
nbformat 5.1.3
nest-asyncio 1.5.1
networkx 3.0
nltk 3.8.1
notebook 6.3.0
numba 0.56.4
numpy 1.23.5
nvtx 0.2.10
packaging 20.9
pandas 1.5.3
pandocfilters 1.4.3
parso 0.8.2
partd 1.4.1
pexpect 4.8.0
pickleshare 0.7.5
pillow 10.2.0
pip 21.0.1
pip-magic 0.2.3
plotly 5.22.0
prettytable 3.10.0
prometheus-client 0.10.1
prompt-toolkit 3.0.18
protobuf 4.21.12
psutil 5.9.8
ptxcompiler-cu11 0.7.0.post1
ptyprocess 0.7.0
pyarrow 10.0.1
pycparser 2.20
pycryptodome 3.20.0
pydantic 1.10.17
pygments 2.18.0
pylibcugraph-cu11 23.4.1
pylibraft-cu11 23.4.1
pynvml 11.4.1
pyparsing 2.4.7
pyrsistent 0.17.3
python-dateutil 2.8.1
pytz 2021.1
PyYAML 6.0.1
pyzmq 22.0.3
raft-dask-cu11 23.4.1
regex 2023.12.25
requests 2.25.1
rich 13.7.1
rmm-cu11 23.4.1
safetensors 0.4.2
scikit-learn 1.3.2
scipy 1.10.1
seaborn 0.13.2
Send2Trash 1.5.0
sentence-transformers 2.2.2
sentencepiece 0.2.0
setuptools 52.0.0.post20210125
six 1.15.0
sniffio 1.2.0
sortedcontainers 2.4.0
sparkmagic 0.19.1.12
starlette 0.16.0
supervisor 4.2.5
sympy 1.12
tblib 3.0.0
tenacity 8.5.0
terminado 0.9.4
testpath 0.4.4
threadpoolctl 3.3.0
tokenizers 0.15.2
tomlkit 0.7.0
toolz 0.12.1
torch 2.1.0+cu118
torchaudio 2.1.0+cu118
torchvision 0.16.0+cu118
tornado 6.1
tqdm 4.66.2
traitlets 5.0.5
transformers 4.38.2
treelite 3.2.0
treelite-runtime 3.2.0
triton 2.1.0
typing-extensions 4.10.0
ucx-py-cu11 0.31.1
urllib3 1.26.4
urwid 2.6.15
uvicorn 0.15.0
venus-api-all 1.3.21
venus-api-base 1.3.21
venus-boot 1.3.21
venus-extension 0.8.1
venus-flow 1.3.21
venus-flow-operator 1.3.21
venus-mdfs 0.1.0
venus-ml 1.3.21
venus-sdk 1.3.21
venus-tools 1.3.21
wcwidth 0.2.5
webencodings 0.5.1
wheel 0.36.2
widgetsnbextension 4.0.11
xmltodict 0.13.0
yarl 1.9.4
zict 3.0.0
zipp 3.17.0
zope.interface 6.4.post2`
The text was updated successfully, but these errors were encountered: