-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] cannot find -lcurand and -lcudart #3929
Comments
Hi @GasolSun36 - if you don't see either of the cuda.so files in your cuda install path, my guess is that something went wrong on the cuda install. Unfortunately, there is more to these than just the .so files, so you probably want to re-install cuda and ensure it installs correctly with both curand and cudart. Can you try re-installing cuda and let us know if that fixes your issue? |
Hi @GasolSun36 - were you able to test the re-install of cuda? |
Hi @loadams , I have the same issue. When I run
I get
I checked the path I am not sure what is the problem with my cuda installation. Can you help me? Thanks. Best |
I find that there is no |
Thanks a lot! That sounds like a solution. However, I created a new environment and found that the reason for this was that I was using conda to download version 1.13.1 of pytorch, and after using pip to download it, the issue was resolved. |
Could we re-open this? I have the same issue on SageMaker, copying the DLLs doesn't seem like a fix but a workaround. Couldn't we include the |
maybe pip install pytorch could solve your problem, works for me. |
It's not related to pip, it's just that conda installs in the Temporary solution for anyone reading this, set the environment variables
|
Hi @maximegmd - it would be interesting to know why SageMaker/the OS is using the lib folder and not lib64, but thanks for the PR, we will get that merged, so no need to re-open this issue, right? |
Thank you @GasolSun36, this finally solved the issue! |
Hey @pacman100 - do you need any other support on this issue or are things working now? |
This worked for me too. I actually just made a symbolic link for that libcurand file, same idea. Like this:
|
It woked for me, thanks. |
but why it uses lib instead of lib64? |
Look at the command that was run right before the error occurred. I am running deep speed using hugging face accelerate framework. for me it loks like the follwoing
So clearly the linking directory is therefore if you symlink the library into this directory the error should go away. So I find it in my cuda installation and then |
Describe the bug
The error is
FAILED: cpu_adam.so c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/home/xuchengjin/anaconda3/envs/test/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/home/xuchengjin/anaconda3/envs/test/lib64 -lcudart -o cpu_adam.so /usr/bin/ld: cannot find -lcurand /usr/bin/ld: cannot find -lcudart
When I see in the /torch/lib, there's really no lib_curand.so and lib_cudart.io in there.
Is this normal? Or is there something wrong with my cuda installation? Can I copy this two files that someone else already has into my directory?
To Reproduce
Steps to reproduce the behavior:
I'm running stanford_alpaca train.py, and using
to start the training.
The "default_offload_opt_param.json" is:
Expected behavior
ds_report output
System info (please complete the following information):
Launcher context
Are you launching your experiment with the
deepspeed
launcher, MPI, or something else?Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: