-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
supports running on CPU for GGML_USE_CUBLAS=ON build #3946
supports running on CPU for GGML_USE_CUBLAS=ON build #3946
Conversation
1caf0c4
to
1a1ffd4
Compare
We might want to merge something like this PR, so that 3rd party projects have an easier way to support optional CPU-only runs. Though, I'm not sure if this is the best way to do. |
Some alternative ideas considered when prototyping this pull request (assuming
|
Most of this will become obsolete after llama.cpp is adapted to use ggml-backend. After that, the way this will be implemented is by making |
32f07ea
to
42e642a
Compare
I cleaned up it a bit and think it should be relative easy for a ggml_backend migration. |
9c655dc
to
b66fdd1
Compare
b66fdd1
to
c58e809
Compare
if (ggml_cublas_loaded()) { | ||
return ggml_cuda_host_malloc(n); | ||
} else { | ||
return malloc(n); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move the ggml_cublas_loaded()
checks in side the ggml_cuda_host_malloc()
and ggml_cuda_host_free()
calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do that, but I feel it might be better to make it explicit. This way, downstream code will have a chance to differentiate whether the memory is actually allocating CUDA RAM, which is likely to involve certain memory alignment requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can serve as a temp solution, until we start using the new backend and refactor this code.
One question: if I had a machine with a CUDA device, but I still wanted to force CPU-only computation, what would be my option - set CUDA_VISIBLE_DEVICES=-1
? Note that simply setting -ngl 0
would not work because ggml
will keep moving data to the device and do some of the computations there instead of the CPU
@@ -18,6 +18,8 @@ extern "C" { | |||
#define GGML_CUDA_MAX_DEVICES 16 | |||
|
|||
GGML_API void ggml_init_cublas(void); | |||
GGML_API bool ggml_cublas_loaded(void); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment that we differentiate between "initialized" and "loaded" state. The former means we have called ggml_init_cublas()
but it's not guaranteed that there has been a CUDA device available, in which case ggml_cublas_loaded()
is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
I experimented with |
It's possible that the behaviour with |
That's not the case, CUDA is still used for large matrix multiplication regardless of the value of |
Kindly ping @cebtenzzre |
…v#3946) * protyping the idea that supports running on CPU for a GGML_USE_CUBLAS=on build * doc: add comments to ggml_cublas_loaded() * fix defined(...)
…v#3946) * protyping the idea that supports running on CPU for a GGML_USE_CUBLAS=on build * doc: add comments to ggml_cublas_loaded() * fix defined(...)
Due to newer changes adding link against libcuda - the fix is no longer working. It will generates following error message for a cuda build when running in non-cuda environment:
#4606 seems to be the culprint Looking into workaround |
Should work on a machine without CUDA runtime but
model.n_gpu_layers = 0
.The current behavior in master is throwing following error on a non-cuda machine when
GGML_USE_CUBLAS=ON
master
CPU
this PR
CPU
CPU but requesting ngl > 0
CUDA