llama-server --tensor-split #8735

Okoyl · 2024-07-28T11:08:24Z

Okoyl
Jul 28, 2024

Hey, About two months ago, the server had the command line argument --tensor-split, allowing splitting the layer count across multiple GPUs.
I used it on a 4*Tesla V100 16GB machine to make sure the first GPU is always a bit free for cache with value like "7,9,9,9".

But I see this feature has been dropped, and I fail to allocate cache now as it keeps seeking the first device.

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2688.00 MiB on device 0: cudaMalloc failed: out of memory
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache

Why was this feature dropped?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama-server --tensor-split #8735

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

llama-server --tensor-split #8735

Okoyl Jul 28, 2024

Replies: 0 comments

Okoyl
Jul 28, 2024