-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : offload to RPC in addition to other backends #7640
Conversation
The way this is supposed to work is that backends need to check the buffer type of the tensors to determine if they can perform the copy, and return |
It crashes in |
The |
rgerganov#1 should fix it. |
@slaren thanks, I have verified your fix and merged it |
…uffer - always initialize views in the view_src buffer - add RPC backend to Makefile build - add endpoint to all RPC object names
Co-authored-by: slaren <slarengh@gmail.com>
My changes have been reviewed and the CI run was successful.
I don't have any privileges in this project and my code is being reviewed as everybody's else. You'd better have some arguments when saying something about the quality of my changes.
When someone is proposing a change it is their responsibility to address review comments and make sure that the change can be applied on current
My change has been reviewed and approved by a core maintainer and the CI run was successful. |
void ggml_backend_view_init(struct ggml_tensor * tensor) { | ||
GGML_ASSERT(tensor->buffer == NULL); | ||
GGML_ASSERT(tensor->view_src != NULL); | ||
GGML_ASSERT(tensor->view_src->buffer != NULL); | ||
GGML_ASSERT(tensor->view_src->data != NULL); | ||
|
||
tensor->buffer = buffer; | ||
tensor->buffer = tensor->view_src->buffer; | ||
tensor->data = (char *)tensor->view_src->data + tensor->view_offs; | ||
ggml_backend_buffer_init_tensor(buffer, tensor); | ||
ggml_backend_buffer_init_tensor(tensor->buffer, tensor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks the latest SYCL support. Could I know if all the other backends assumes the buffer already bound to the right tensor? Of cause I know it is the issue of SYCL itself, we are maintaining SYCL during spare time and are still begging for official support from the company :) I will look into it and try to fix it soon.
@slaren @rgerganov for awareness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the same issue that was fixed for the Vulkan backend in #7806, there are more details about why this happens and why the change was necessary there. The best way to fix this for the SYCL backend would be to remove the extras entirely in the same way they were removed from the CUDA backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are maintaining SYCL during spare time and are still begging for official support from the company
Having a dedicated machine which runs the CI with the SYCL backend would be very helpful
…g#7640)" (ggml-org#7981) This reverts commit bde7cd3.
I see that you reverted this merge. Does it mean that it will not work with others backend? I'm using Vulkan instead of CUDA (for many reasons), and actually I can see that |
I also encountered such a problem. I compiled the rpc of opencl backend for Android, but it was output when I run it . But the independent llama-cli can call the opencl backend. |
@zhouwg ok, I have deleted the comment. |
This patch adds support for offloading layers to RPC servers in addition to other non-RPC backends. For example if you build with
-DLLAMA_CUDA=ON -DLLAMA_RPC=ON
, then you can offload to local GPU and remote server(s):$ bin/main -m ../models/ggml-model-f16.gguf -p "Hello, my name is" -n 64 -ngl 99 -s 1236 --rpc localhost:50052 ... ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA T1200 Laptop GPU, compute capability 7.5, VMM: yes llm_load_tensors: ggml ctx size = 0,31 MiB llm_load_tensors: offloading 22 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 23/23 layers to GPU llm_load_tensors: CPU buffer size = 125,00 MiB llm_load_tensors: CUDA0 buffer size = 1008,19 MiB llm_load_tensors: RPC buffer size = 965,16 MiB ..........................................................................................
I have tried to follow the existing patterns in
llama.cpp
and introduced device numbers for RPC servers which always come last.When copying tensors, we need to handle the case when
src
anddst
are not on the same backend. For CUDA I had to build with-DLLAMA_CUDA_NO_PEER_COPY=ON
to make it work.