Memory allocation increases until OOM - llama.cpp server #5993

Neb2653 · 2024-03-11T09:50:08Z

Hi,

We need some advice from the community to be able to fix this issue.

We are running the server :

./server -t 32 --threads-http 32 --no-mmap -ngl 999 --batch-size 32 -m /opt/models/mixtral_ollama/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf -c 131072 --parallel 512 --host 0.0.0.0 --port 8091

We have configured Huggingface chat-ui for user interaction.
If we try a stress test asking 20-30 users to write at the same time we see that the memory is accumulating and once everyone stops, the memory is not released, it stays there. At some point, we have OOM because the memory is not released at any point.
My question is, how we can tune this so that the memory usage can be decreased if no one is writing in the chat and avoiding outofmemory issue at CUDA level.

Mar 11 11:45:24 srvmlwrkt01t systemd[1]: Started llama.cpp Service.
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: ggml_init_cublas: GGML_CUDA_FORCE_MMQ: Mar 11 11:45:24 srvmlwrkt01t server[1385897]: ggml_init_cublas: CUDA_USE_TENSOR_CORES: Mar 11 11:45:24 srvmlwrkt01t server[1385897]: ggml_init_cublas: found Mar 11 11:45:24 srvmlwrkt01t server[1385897]: Device 0: GRID A100D-80C, Mar 11 11:45:24 srvmlwrkt01t server[1385897]: {"build":0,"commit":"unknown" Mar 11 11:45:24 srvmlwrkt01t server[1385897]: {"function":"main","level":"I Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 0: Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 1: Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 2: Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 3: Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 4: Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 5: Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 6: Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 7: Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 8: Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 9: Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 10: Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 11: Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 12: Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 13: Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 14: Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 15: Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 16: Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 17: Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 18: Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 19: Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 20: Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 21: Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 22: Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 23: Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 24: Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 25: Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - type f32: Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - type f16: Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - type q8_0: Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - type q5_K: Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - type q6_K: Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_vocab: special Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: format Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: arch Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: vocab type Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_vocab Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_merges Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_ctx_train Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_embd Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_head Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_head_kv Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_layer Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_rot Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_embd_head_k Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_embd_head_v Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_gqa Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_embd_k_gqa Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_embd_v_gqa Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: f_norm_eps Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: f_norm_rms_eps Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: f_clamp_kqv Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_ff Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_expert Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_expert_used Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: pooling type Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: rope type Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: rope scaling Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: freq_base_train Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_yarn_orig_ctx Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: rope_finetuned Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: model type Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: model ftype Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: model params Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: model size Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: general.name Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: BOS token Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: EOS token Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: UNK token Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: PAD token Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: LF token Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_tensors: ggml ctx size = Mar 11 11:45:33 srvmlwrkt01t server[1385897]: llm_load_tensors: offloading Mar 11 11:45:33 srvmlwrkt01t server[1385897]: llm_load_tensors: offloading Mar 11 11:45:33 srvmlwrkt01t server[1385897]: llm_load_tensors: offloaded Mar 11 11:45:33 srvmlwrkt01t server[1385897]: llm_load_tensors: CUDA_Host Mar 11 11:45:33 srvmlwrkt01t server[1385897]: llm_load_tensors: no
yes
1 CUDA devices:
compute capability 8.0, VMM: no
,"function":"main","level":"INFO","line":2796,"msg":"build info","tid":"139841044271104","timestamp":1710150324}
NFO","line":2803,"msg":"system info","n_threads":32,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"139841044271104","timestamp":1710150324,"total_threads":8}
loaded meta data with 26 key-value pairs and 995 tensors from /opt/models/mixtral_ollama/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf (version GGUF V3 (latest))
Dumping metadata keys/values. Note: KV overrides do not apply in this output.
general.architecture str = llama
general.name str = mistralai_mixtral-8x7b-instruct-v0.1
llama.context_length u32 = 32768
llama.embedding_length u32 = 4096
llama.block_count u32 = 32
llama.feed_forward_length u32 = 14336
llama.rope.dimension_count u32 = 128
llama.attention.head_count u32 = 32
llama.attention.head_count_kv u32 = 8
llama.expert_count u32 = 8
llama.expert_used_count u32 = 2
llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama.rope.freq_base f32 = 1000000.000000
general.file_type u32 = 17
tokenizer.ggml.model str = llama
tokenizer.ggml.tokens arr[str,32000] = ["", "~~", "~~", "<0x00>", "<...
tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
tokenizer.ggml.bos_token_id u32 = 1
tokenizer.ggml.eos_token_id u32 = 2
tokenizer.ggml.unknown_token_id u32 = 0
tokenizer.ggml.padding_token_id u32 = 0
tokenizer.ggml.add_bos_token bool = true
tokenizer.ggml.add_eos_token bool = false
tokenizer.chat_template str = {{ bos_token }}{% for message in mess...
general.quantization_version u32 = 2
65 tensors
32 tensors
64 tensors
833 tensors
1 tensors
tokens definition check successful ( 259/32000 ).
= GGUF V3 (latest)
= llama
= SPM
= 32000
= 0
= 32768
= 4096
= 32
= 8
= 32
= 128
= 128
= 128
= 4
= 1024
= 1024
= 0.0e+00
= 1.0e-05
= 0.0e+00
f_max_alibi_bias = 0.0e+00
= 14336
= 8
= 2
= 0
= 0
= linear
= 1000000.0
freq_scale_train = 1
= 32768
= unknown
= 7B
= Q5_K - Medium
= 46.70 B
= 30.02 GiB (5.52 BPW)
= mistralai_mixtral-8x7b-instruct-v0.1
= 1 ''
= 2 ''
= 0 ''
= 0 ''
= 13 '<0x0A>'
0.76 MiB
32 repeating layers to GPU
non-repeating layers to GPU
33/33 layers to GPU
buffer size = 85.94 MiB
CUDA0 buffer size = 30649.55 MiB

Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: n_ctx = 131072
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: freq_base = 1000000.0
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: freq_scale = 1
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_kv_cache_init: CUDA0 KV buffer size = 16384.00 MiB
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: KV self size = 16384.00 MiB, K (f16): 8192.00 MiB, V (f16): 8192.00 MiB
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: CUDA_Host input buffer size = 17.50 MiB
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: CUDA0 compute buffer size = 531.25 MiB
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: CUDA_Host compute buffer size = 0.50 MiB
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: graph splits (measure): 2
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: {"function":"initialize","level":"INFO","line":426,"msg":"initializing slots","n_s

Any advice will be appreciated!

ggerganov · 2024-03-11T11:40:00Z

Does it help if you add -dt 0.1 to the CLI args? The memory would still not be released, but I think it should prevent from growing indefinitely

Neb2653 · 2024-03-11T12:08:56Z

I tried to add that argument but it is not working, -dt is not a server argument.

Do we need to add it somewhere else?

Thanks for the fast response.

ggerganov · 2024-03-11T12:12:14Z

The argument was added recently in #5941

You can either update to latest master or apply the patch manually - it's small: 52c76d5

slaren · 2024-03-11T13:05:37Z

This happens when computing kqv due to the buffer that is allocated to convert kq to FP16 in ggml_cuda_mul_mat_batched_cublas. Normally, the biggest allocation in the CUDA pool is the buffer to convert the biggest weight to FP16, but with very large contexts the size of kq can exceed by far the size of any weights. After flash attention is implemented, this conversion will no longer be necessary and this should be fixed. To be clear, this is not a leak, the memory usage does not increase indefinitely. Reducing the context size should fix it.

That said, the size of this buffer should only be about half of the size of the compute buffer size.

sykuann · 2024-03-12T06:23:22Z

Hi, i am running llama_cpp on linux vm with python. Do you mind to let me know where is the file that i can change the above code? I have a hard time finding the files(example/server/server.cpp)

phymbert · 2024-03-12T06:57:44Z

Hi,

I am running a lot of perf tests on A100 on different models (llama70b, mixtral8x7b) since a while and I do not face this issue with a KV Cache size of 32K.

Note: having 512 slots for testing 25-30 users is not appropriate, with 131072 kv cache size, you got only 256 total tokens per slots... Probably a lot of context shift occurs, which is very slow in the current implementation. Also, you did not enable continuous batching.

@Neb2653 I advise you to test the following setup, a balanced approach between PP and TG for 1 A100 with mixtral8x7b:

CUDA_VISIBLE_DEVICES=0 server --model mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf  \
    --threads 1 \
    --threads-batch 32 \
    --batch-size 256 \
    --ctx-size 32768 \
    --parallel 32 \
    --n-gpu-layers 33 \
    --cont-batching  \
    --metrics \
    --main-gpu 0 \
    --log-format text \
    --defrag-thold 0.8

If you are doing performance tests, I encourage you to scrap /metrics with prometheus and monitor metrics exported by the server to tune the KV Cache size and set the relevant number of slots based on deferred requests.

Also, for more users, @ggerganov I think it's better to scale the number of servers replicas than increasing the KV Cache size at the moment keeping into account the issue identified by @slaren while we are all waiting for the new CUDA backend :)

phymbert · 2024-03-12T07:28:35Z

@ggerganov From my understanding, --threads is not used anymore with batching, I think this parameter is misleading for users. If you are OK, I would be happy to remove its support in the server ?

Neb2653 · 2024-03-12T09:36:31Z

Thanks a lot guys for all the responses. We will test and get back to you.

ggerganov · 2024-03-12T10:26:53Z

--threads and --threads-batch are relevant when the model is not fully offloaded on the GPU. When it is fully offloaded, these parameters have no effect

phymbert · 2024-03-12T10:47:12Z

--threads and --threads-batch are relevant when the model is not fully offloaded on the GPU. When it is fully offloaded, these parameters have no effect

Are you sure --threads-batch has no effect when the model is fully offloaded to VRAM ?

https://github.com/ggerganov/llama.cpp/blob/48358b2e5b3983c41ba7e61a493e84d3901dc7b9/llama.cpp#L8770

Backend CPU is never null:
https://github.com/ggerganov/llama.cpp/blob/48358b2e5b3983c41ba7e61a493e84d3901dc7b9/llama.cpp#L12866

If I run with --threads-batch 1 the server is incredibly slow. @ggerganov I miss something ? I have -ngl 81 (max for a 70b llama2).

slaren · 2024-03-12T11:14:58Z

With full GPU offload only the input layer is run on the CPU, which is just a get_rows operation that does not support multi-threading.

phymbert · 2024-03-12T11:23:18Z

With full GPU offload only the input layer is run on the CPU, which is just a get_rows operation that does not support multi-threading.

Thanks for the precision, is it the same when you have multiple inputs in the same batch ? i.e with n_parallel.

slaren · 2024-03-12T11:42:06Z

Yes, it only uses one thread regardless of the batch size. Using multiple threads actually hurts GPU performance with full offload due to the overhead of starting the threads.

phymbert · 2024-03-12T11:45:06Z

Then there is something I dont understand because --threads-batch has an impact on server performance. I will test again with 1 and revert to you with figures.

ggerganov · 2024-03-12T11:47:41Z

Then there is something I dont understand because --threads-batch has an impact on server performance. I will test again with 1 and revert to you with figures.

Hm, that's unexpected. As pointed out by @slaren, it should always end up using 1 thread. Do you observe the same behaviour with a LLaMA model (i.e. non-Mixtral)

@slaren Should we try to multi-thread get_rows for n_rows > 1? Maybe it can lead to some gains for prompt processing, even with the overhead from starting threads

slaren · 2024-03-12T11:51:02Z

I actually already implemented multi-threaded get rows for pipeline parallelism (see here: 602a719), but I ended not using it because it is still slower than just using 1 thread. However it will use multiple threads if there are other operations in the graph that also support multi-threading.

ggerganov · 2024-03-12T11:56:57Z

Nice. Surprising that it does not help even for large batches. Seem the thread-start overhead is significantly higher than my expectations and how it compares to get_rows compute/mem requirements

Btw, what was the reasoning for not offloading the input layer to the GPU? The comment mentions that it leads to little benefit - what are the downsides?

slaren · 2024-03-12T11:58:57Z

The downside of offloading the input layer is higher VRAM usage, as the token embeddings weights can be quite big.

Neb2653 · 2024-03-15T07:54:56Z

The chat is stable now having 32 parallel sessions.

I think we can close this topic. Thanks everyone for the response, it was very helpful.

Neb2653 closed this as completed Mar 15, 2024

tikikun mentioned this issue Mar 28, 2024

feat: version pump to fix some of the pending issues in nitro janhq/cortex.cpp#473

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory allocation increases until OOM - llama.cpp server #5993

Memory allocation increases until OOM - llama.cpp server #5993

Neb2653 commented Mar 11, 2024 •

edited

Loading

ggerganov commented Mar 11, 2024

Neb2653 commented Mar 11, 2024

ggerganov commented Mar 11, 2024

slaren commented Mar 11, 2024 •

edited

Loading

sykuann commented Mar 12, 2024

phymbert commented Mar 12, 2024

phymbert commented Mar 12, 2024

Neb2653 commented Mar 12, 2024

ggerganov commented Mar 12, 2024

phymbert commented Mar 12, 2024

slaren commented Mar 12, 2024

phymbert commented Mar 12, 2024

slaren commented Mar 12, 2024

phymbert commented Mar 12, 2024 •

edited

Loading

ggerganov commented Mar 12, 2024

slaren commented Mar 12, 2024

ggerganov commented Mar 12, 2024

slaren commented Mar 12, 2024

Neb2653 commented Mar 15, 2024

Memory allocation increases until OOM - llama.cpp server #5993

Memory allocation increases until OOM - llama.cpp server #5993

Comments

Neb2653 commented Mar 11, 2024 • edited Loading

ggerganov commented Mar 11, 2024

Neb2653 commented Mar 11, 2024

ggerganov commented Mar 11, 2024

slaren commented Mar 11, 2024 • edited Loading

sykuann commented Mar 12, 2024

phymbert commented Mar 12, 2024

phymbert commented Mar 12, 2024

Neb2653 commented Mar 12, 2024

ggerganov commented Mar 12, 2024

phymbert commented Mar 12, 2024

slaren commented Mar 12, 2024

phymbert commented Mar 12, 2024

slaren commented Mar 12, 2024

phymbert commented Mar 12, 2024 • edited Loading

ggerganov commented Mar 12, 2024

slaren commented Mar 12, 2024

ggerganov commented Mar 12, 2024

slaren commented Mar 12, 2024

Neb2653 commented Mar 15, 2024

Neb2653 commented Mar 11, 2024 •

edited

Loading

slaren commented Mar 11, 2024 •

edited

Loading

phymbert commented Mar 12, 2024 •

edited

Loading