-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory allocation increases until OOM - llama.cpp server #5993
Comments
Does it help if you add |
I tried to add that argument but it is not working, -dt is not a server argument. Do we need to add it somewhere else? Thanks for the fast response. |
This happens when computing That said, the size of this buffer should only be about half of the size of the compute buffer size. |
Hi, i am running llama_cpp on linux vm with python. Do you mind to let me know where is the file that i can change the above code? I have a hard time finding the files(example/server/server.cpp) |
Hi, I am running a lot of perf tests on Note: having 512 slots for testing 25-30 users is not appropriate, with 131072 kv cache size, you got only 256 total tokens per slots... Probably a lot of context shift occurs, which is very slow in the current implementation. Also, you did not enable continuous batching. @Neb2653 I advise you to test the following setup, a balanced approach between PP and TG for 1 CUDA_VISIBLE_DEVICES=0 server --model mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf \
--threads 1 \
--threads-batch 32 \
--batch-size 256 \
--ctx-size 32768 \
--parallel 32 \
--n-gpu-layers 33 \
--cont-batching \
--metrics \
--main-gpu 0 \
--log-format text \
--defrag-thold 0.8 If you are doing performance tests, I encourage you to scrap Also, for more users, @ggerganov I think it's better to scale the number of servers replicas than increasing the KV Cache size at the moment keeping into account the issue identified by @slaren while we are all waiting for the new CUDA backend :) |
@ggerganov From my understanding, |
Thanks a lot guys for all the responses. We will test and get back to you. |
|
Are you sure https://github.com/ggerganov/llama.cpp/blob/48358b2e5b3983c41ba7e61a493e84d3901dc7b9/llama.cpp#L8770 Backend CPU is never null: If I run with |
With full GPU offload only the input layer is run on the CPU, which is just a get_rows operation that does not support multi-threading. |
Thanks for the precision, is it the same when you have multiple inputs in the same batch ? i.e with n_parallel. |
Yes, it only uses one thread regardless of the batch size. Using multiple threads actually hurts GPU performance with full offload due to the overhead of starting the threads. |
Then there is something I dont understand because |
Hm, that's unexpected. As pointed out by @slaren, it should always end up using 1 thread. Do you observe the same behaviour with a LLaMA model (i.e. non-Mixtral) @slaren Should we try to multi-thread |
I actually already implemented multi-threaded get rows for pipeline parallelism (see here: 602a719), but I ended not using it because it is still slower than just using 1 thread. However it will use multiple threads if there are other operations in the graph that also support multi-threading. |
Nice. Surprising that it does not help even for large batches. Seem the thread-start overhead is significantly higher than my expectations and how it compares to Btw, what was the reasoning for not offloading the input layer to the GPU? The comment mentions that it leads to little benefit - what are the downsides? |
The downside of offloading the input layer is higher VRAM usage, as the token embeddings weights can be quite big. |
The chat is stable now having 32 parallel sessions. I think we can close this topic. Thanks everyone for the response, it was very helpful. |
Hi,
We need some advice from the community to be able to fix this issue.
We are running the server :
./server -t 32 --threads-http 32 --no-mmap -ngl 999 --batch-size 32 -m /opt/models/mixtral_ollama/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf -c 131072 --parallel 512 --host 0.0.0.0 --port 8091
We have configured Huggingface chat-ui for user interaction.
If we try a stress test asking 20-30 users to write at the same time we see that the memory is accumulating and once everyone stops, the memory is not released, it stays there. At some point, we have OOM because the memory is not released at any point.
My question is, how we can tune this so that the memory usage can be decreased if no one is writing in the chat and avoiding outofmemory issue at CUDA level.
Mar 11 11:45:24 srvmlwrkt01t systemd[1]: Started llama.cpp Service.
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: ggml_init_cublas: found 1 CUDA devices:
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: Device 0: GRID A100D-80C, compute capability 8.0, VMM: no
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: {"build":0,"commit":"unknown","function":"main","level":"INFO","line":2796,"msg":"build info","tid":"139841044271104","timestamp":1710150324}
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: {"function":"main","level":"INFO","line":2803,"msg":"system info","n_threads":32,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"139841044271104","timestamp":1710150324,"total_threads":8}
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: loaded meta data with 26 key-value pairs and 995 tensors from /opt/models/mixtral_ollama/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf (version GGUF V3 (latest))
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 0: general.architecture str = llama
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 1: general.name str = mistralai_mixtral-8x7b-instruct-v0.1
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 2: llama.context_length u32 = 32768
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 4: llama.block_count u32 = 32
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 9: llama.expert_count u32 = 8
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 10: llama.expert_used_count u32 = 2
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 11: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 12: llama.rope.freq_base f32 = 1000000.000000
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 13: general.file_type u32 = 17
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 14: tokenizer.ggml.model str = llama
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32000] = ["", "
", "", "<0x00>", "<...Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 1
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 2
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 20: tokenizer.ggml.unknown_token_id u32 = 0
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 24: tokenizer.chat_template str = {{ bos_token }}{% for message in mess...
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 25: general.quantization_version u32 = 2
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - type f32: 65 tensors
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - type f16: 32 tensors
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - type q8_0: 64 tensors
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - type q5_K: 833 tensors
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - type q6_K: 1 tensors
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_vocab: special tokens definition check successful ( 259/32000 ).
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: format = GGUF V3 (latest)
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: arch = llama
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: vocab type = SPM
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_vocab = 32000
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_merges = 0
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_ctx_train = 32768
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_embd = 4096
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_head = 32
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_head_kv = 8
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_layer = 32
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_rot = 128
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_embd_head_k = 128
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_embd_head_v = 128
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_gqa = 4
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_embd_k_gqa = 1024
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_embd_v_gqa = 1024
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: f_norm_eps = 0.0e+00
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_ff = 14336
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_expert = 8
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_expert_used = 2
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: pooling type = 0
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: rope type = 0
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: rope scaling = linear
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: freq_base_train = 1000000.0
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: freq_scale_train = 1
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_yarn_orig_ctx = 32768
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: rope_finetuned = unknown
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: model type = 7B
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: model ftype = Q5_K - Medium
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: model params = 46.70 B
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: model size = 30.02 GiB (5.52 BPW)
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: general.name = mistralai_mixtral-8x7b-instruct-v0.1
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: BOS token = 1 '
''Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: EOS token = 2 '
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: UNK token = 0 ''
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: PAD token = 0 ''
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: LF token = 13 '<0x0A>'
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_tensors: ggml ctx size = 0.76 MiB
Mar 11 11:45:33 srvmlwrkt01t server[1385897]: llm_load_tensors: offloading 32 repeating layers to GPU
Mar 11 11:45:33 srvmlwrkt01t server[1385897]: llm_load_tensors: offloading non-repeating layers to GPU
Mar 11 11:45:33 srvmlwrkt01t server[1385897]: llm_load_tensors: offloaded 33/33 layers to GPU
Mar 11 11:45:33 srvmlwrkt01t server[1385897]: llm_load_tensors: CUDA_Host buffer size = 85.94 MiB
Mar 11 11:45:33 srvmlwrkt01t server[1385897]: llm_load_tensors: CUDA0 buffer size = 30649.55 MiB
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: n_ctx = 131072
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: freq_base = 1000000.0
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: freq_scale = 1
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_kv_cache_init: CUDA0 KV buffer size = 16384.00 MiB
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: KV self size = 16384.00 MiB, K (f16): 8192.00 MiB, V (f16): 8192.00 MiB
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: CUDA_Host input buffer size = 17.50 MiB
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: CUDA0 compute buffer size = 531.25 MiB
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: CUDA_Host compute buffer size = 0.50 MiB
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: graph splits (measure): 2
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: {"function":"initialize","level":"INFO","line":426,"msg":"initializing slots","n_s
Any advice will be appreciated!
The text was updated successfully, but these errors were encountered: