Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Error building extension 'cpu_adam', because /usr/bin/ld: can not find -lcurand,help! #5659

Closed
hekaijie123 opened this issue Jun 14, 2024 · 7 comments
Assignees

Comments

@hekaijie123
Copy link

python -c 'import deepspeed; deepspeed.ops.adam.cpu_adam.CPUAdamBuilder().load()'
[2024-06-14 14:24:07,747] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
Using /home/jxlab03/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Creating extension directory /home/jxlab03/.cache/torch_extensions/py310_cu118/cpu_adam...
Emitting ninja build file /home/jxlab03/.cache/torch_extensions/py310_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include/TH -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include/THC -isystem /home/jxlab03/anaconda3/envs/minicpm/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX512 -D__ENABLE_CUDA_ -DBF16_AVAILABLE -c /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
[2/3] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include/TH -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include/THC -isystem /home/jxlab03/anaconda3/envs/minicpm/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX512 -D__ENABLE_CUDA_ -DBF16_AVAILABLE -c /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o
[3/3] c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
FAILED: cpu_adam.so
c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
/usr/bin/ld: 找不到 -lcurand
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
subprocess.run(
File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "", line 1, in
File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 508, in load
return self.jit_load(verbose)
File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 555, in jit_load
op_module = load(name=self.name,
File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
_write_ninja_file_and_build_library(
File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'cpu_adam'

but I use "ldconfig -p | grep libcurand" in terminal, is can see the ibcurand.so.
libcurand.so.10 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcurand.so.10
libcurand.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcurand.so

torch cuda version 和 nvcc version is match, is 11.8.

So, I don't konw why ninja can find -lcurand?

@Mr-lonely0
Copy link

same problem.
Have you solved it?

@Mucalinda2436
Copy link

same problem!why

@fly-dragon211
Copy link

same problem

Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/data/miniconda3/envs/env-novelai/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
FAILED: cpu_adam.so 
c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/data/miniconda3/envs/env-novelai/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
/usr/bin/ld: cannot find -lcurand
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/data/miniconda3/envs/env-novelai/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2107, in _run_ninja_build
    subprocess.run(
  File "/data/miniconda3/envs/env-novelai/lib/python3.10/subprocess.py", line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

@fly-dragon211
Copy link

按照这里的方法,解决了 #3929 (comment)

cd /home/asdf/.local/lib/python3.10/site-packages/torch/lib
ln -s /usr/local/cuda/lib64/libcurand.so .

there is no lib64 under /home/enwei/anaconda3/envs/llama. So I copied everything in lib to lib64, and the problem is solved for me.

@lekurile
Copy link
Contributor

lekurile commented Jul 17, 2024

Hi @hekaijie123,

Can you please share your environment variables, specifically LIBRARY_PATH and LD_LIBRARY_PATH?

We've found that explicitly setting LIBRARY_PATH to point to lib64 resolves this exact issue, e.g.:

export CUDA_HOME="/usr/local/cuda-12.5"
export LIBRARY_PATH="/usr/local/cuda-12.5/lib64:$LIBRARY_PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-12.5/lib64:$LD_LIBRARY_PATH"

@AlanBlanchet
Copy link

Hi @hekaijie123,

Can you please share your environment variables, specifically LIBRARY_PATH and LD_LIBRARY_PATH?

We've found that explicitly setting LIBRARY_PATH to point to lib64 resolves this exact issue, e.g.:

export CUDA_HOME="/usr/local/cuda-12.5"
export LIBRARY_PATH="/usr/local/cuda-12.5/lib64:$LIBRARY_PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-12.5/lib64:$LD_LIBRARY_PATH"

Setting LIBRARY_PATH worked !
Thanks !

@lekurile
Copy link
Contributor

lekurile commented Aug 1, 2024

Closing issue since #5780 has been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants