Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory allocation increases until OOM - llama.cpp server #5993

Closed
Neb2653 opened this issue Mar 11, 2024 · 19 comments
Closed

Memory allocation increases until OOM - llama.cpp server #5993

Neb2653 opened this issue Mar 11, 2024 · 19 comments

Comments

@Neb2653
Copy link

Neb2653 commented Mar 11, 2024

Hi,

We need some advice from the community to be able to fix this issue.

We are running the server :

./server -t 32 --threads-http 32 --no-mmap -ngl 999 --batch-size 32 -m /opt/models/mixtral_ollama/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf -c 131072 --parallel 512 --host 0.0.0.0 --port 8091

nvidia-smi

We have configured Huggingface chat-ui for user interaction.
If we try a stress test asking 20-30 users to write at the same time we see that the memory is accumulating and once everyone stops, the memory is not released, it stays there. At some point, we have OOM because the memory is not released at any point.
My question is, how we can tune this so that the memory usage can be decreased if no one is writing in the chat and avoiding outofmemory issue at CUDA level.

Mar 11 11:45:24 srvmlwrkt01t systemd[1]: Started llama.cpp Service.
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: ggml_init_cublas: found 1 CUDA devices:
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: Device 0: GRID A100D-80C, compute capability 8.0, VMM: no
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: {"build":0,"commit":"unknown","function":"main","level":"INFO","line":2796,"msg":"build info","tid":"139841044271104","timestamp":1710150324}
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: {"function":"main","level":"INFO","line":2803,"msg":"system info","n_threads":32,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"139841044271104","timestamp":1710150324,"total_threads":8}
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: loaded meta data with 26 key-value pairs and 995 tensors from /opt/models/mixtral_ollama/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf (version GGUF V3 (latest))
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 0: general.architecture str = llama
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 1: general.name str = mistralai_mixtral-8x7b-instruct-v0.1
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 2: llama.context_length u32 = 32768
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 4: llama.block_count u32 = 32
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 9: llama.expert_count u32 = 8
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 10: llama.expert_used_count u32 = 2
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 11: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 12: llama.rope.freq_base f32 = 1000000.000000
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 13: general.file_type u32 = 17
Mar 11 11:45:24 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 14: tokenizer.ggml.model str = llama
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<...
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 1
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 2
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 20: tokenizer.ggml.unknown_token_id u32 = 0
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 24: tokenizer.chat_template str = {{ bos_token }}{% for message in mess...
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - kv 25: general.quantization_version u32 = 2
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - type f32: 65 tensors
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - type f16: 32 tensors
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - type q8_0: 64 tensors
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - type q5_K: 833 tensors
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llama_model_loader: - type q6_K: 1 tensors
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_vocab: special tokens definition check successful ( 259/32000 ).
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: format = GGUF V3 (latest)
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: arch = llama
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: vocab type = SPM
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_vocab = 32000
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_merges = 0
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_ctx_train = 32768
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_embd = 4096
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_head = 32
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_head_kv = 8
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_layer = 32
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_rot = 128
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_embd_head_k = 128
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_embd_head_v = 128
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_gqa = 4
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_embd_k_gqa = 1024
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_embd_v_gqa = 1024
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: f_norm_eps = 0.0e+00
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_ff = 14336
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_expert = 8
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_expert_used = 2
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: pooling type = 0
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: rope type = 0
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: rope scaling = linear
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: freq_base_train = 1000000.0
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: freq_scale_train = 1
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: n_yarn_orig_ctx = 32768
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: rope_finetuned = unknown
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: model type = 7B
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: model ftype = Q5_K - Medium
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: model params = 46.70 B
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: model size = 30.02 GiB (5.52 BPW)
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: general.name = mistralai_mixtral-8x7b-instruct-v0.1
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: BOS token = 1 ''
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: EOS token = 2 '
'
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: UNK token = 0 ''
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: PAD token = 0 ''
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_print_meta: LF token = 13 '<0x0A>'
Mar 11 11:45:25 srvmlwrkt01t server[1385897]: llm_load_tensors: ggml ctx size = 0.76 MiB
Mar 11 11:45:33 srvmlwrkt01t server[1385897]: llm_load_tensors: offloading 32 repeating layers to GPU
Mar 11 11:45:33 srvmlwrkt01t server[1385897]: llm_load_tensors: offloading non-repeating layers to GPU
Mar 11 11:45:33 srvmlwrkt01t server[1385897]: llm_load_tensors: offloaded 33/33 layers to GPU
Mar 11 11:45:33 srvmlwrkt01t server[1385897]: llm_load_tensors: CUDA_Host buffer size = 85.94 MiB
Mar 11 11:45:33 srvmlwrkt01t server[1385897]: llm_load_tensors: CUDA0 buffer size = 30649.55 MiB

Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: n_ctx = 131072
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: freq_base = 1000000.0
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: freq_scale = 1
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_kv_cache_init: CUDA0 KV buffer size = 16384.00 MiB
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: KV self size = 16384.00 MiB, K (f16): 8192.00 MiB, V (f16): 8192.00 MiB
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: CUDA_Host input buffer size = 17.50 MiB
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: CUDA0 compute buffer size = 531.25 MiB
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: CUDA_Host compute buffer size = 0.50 MiB
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: llama_new_context_with_model: graph splits (measure): 2
Mar 11 11:45:40 srvmlwrkt01t server[1385897]: {"function":"initialize","level":"INFO","line":426,"msg":"initializing slots","n_s

Any advice will be appreciated!

@ggerganov
Copy link
Member

Does it help if you add -dt 0.1 to the CLI args? The memory would still not be released, but I think it should prevent from growing indefinitely

@Neb2653
Copy link
Author

Neb2653 commented Mar 11, 2024

I tried to add that argument but it is not working, -dt is not a server argument.

Do we need to add it somewhere else?

Thanks for the fast response.

@ggerganov
Copy link
Member

The argument was added recently in #5941

You can either update to latest master or apply the patch manually - it's small: 52c76d5

@slaren
Copy link
Member

slaren commented Mar 11, 2024

This happens when computing kqv due to the buffer that is allocated to convert kq to FP16 in ggml_cuda_mul_mat_batched_cublas. Normally, the biggest allocation in the CUDA pool is the buffer to convert the biggest weight to FP16, but with very large contexts the size of kq can exceed by far the size of any weights. After flash attention is implemented, this conversion will no longer be necessary and this should be fixed. To be clear, this is not a leak, the memory usage does not increase indefinitely. Reducing the context size should fix it.

That said, the size of this buffer should only be about half of the size of the compute buffer size.

@sykuann
Copy link

sykuann commented Mar 12, 2024

Hi, i am running llama_cpp on linux vm with python. Do you mind to let me know where is the file that i can change the above code? I have a hard time finding the files(example/server/server.cpp)

@phymbert
Copy link
Collaborator

Hi,

I am running a lot of perf tests on A100 on different models (llama70b, mixtral8x7b) since a while and I do not face this issue with a KV Cache size of 32K.

Note: having 512 slots for testing 25-30 users is not appropriate, with 131072 kv cache size, you got only 256 total tokens per slots... Probably a lot of context shift occurs, which is very slow in the current implementation. Also, you did not enable continuous batching.

@Neb2653 I advise you to test the following setup, a balanced approach between PP and TG for 1 A100 with mixtral8x7b:

CUDA_VISIBLE_DEVICES=0 server --model mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf  \
    --threads 1 \
    --threads-batch 32 \
    --batch-size 256 \
    --ctx-size 32768 \
    --parallel 32 \
    --n-gpu-layers 33 \
    --cont-batching  \
    --metrics \
    --main-gpu 0 \
    --log-format text \
    --defrag-thold 0.8

If you are doing performance tests, I encourage you to scrap /metrics with prometheus and monitor metrics exported by the server to tune the KV Cache size and set the relevant number of slots based on deferred requests.

Also, for more users, @ggerganov I think it's better to scale the number of servers replicas than increasing the KV Cache size at the moment keeping into account the issue identified by @slaren while we are all waiting for the new CUDA backend :)

@phymbert
Copy link
Collaborator

@ggerganov From my understanding, --threads is not used anymore with batching, I think this parameter is misleading for users. If you are OK, I would be happy to remove its support in the server ?

@Neb2653
Copy link
Author

Neb2653 commented Mar 12, 2024

Thanks a lot guys for all the responses. We will test and get back to you.

@ggerganov
Copy link
Member

--threads and --threads-batch are relevant when the model is not fully offloaded on the GPU. When it is fully offloaded, these parameters have no effect

@phymbert
Copy link
Collaborator

--threads and --threads-batch are relevant when the model is not fully offloaded on the GPU. When it is fully offloaded, these parameters have no effect

Are you sure --threads-batch has no effect when the model is fully offloaded to VRAM ?

https://github.com/ggerganov/llama.cpp/blob/48358b2e5b3983c41ba7e61a493e84d3901dc7b9/llama.cpp#L8770

Backend CPU is never null:
https://github.com/ggerganov/llama.cpp/blob/48358b2e5b3983c41ba7e61a493e84d3901dc7b9/llama.cpp#L12866

If I run with --threads-batch 1 the server is incredibly slow. @ggerganov I miss something ? I have -ngl 81 (max for a 70b llama2).

@slaren
Copy link
Member

slaren commented Mar 12, 2024

With full GPU offload only the input layer is run on the CPU, which is just a get_rows operation that does not support multi-threading.

@phymbert
Copy link
Collaborator

With full GPU offload only the input layer is run on the CPU, which is just a get_rows operation that does not support multi-threading.

Thanks for the precision, is it the same when you have multiple inputs in the same batch ? i.e with n_parallel.

@slaren
Copy link
Member

slaren commented Mar 12, 2024

Yes, it only uses one thread regardless of the batch size. Using multiple threads actually hurts GPU performance with full offload due to the overhead of starting the threads.

@phymbert
Copy link
Collaborator

phymbert commented Mar 12, 2024

Then there is something I dont understand because --threads-batch has an impact on server performance. I will test again with 1 and revert to you with figures.

@ggerganov
Copy link
Member

Then there is something I dont understand because --threads-batch has an impact on server performance. I will test again with 1 and revert to you with figures.

Hm, that's unexpected. As pointed out by @slaren, it should always end up using 1 thread. Do you observe the same behaviour with a LLaMA model (i.e. non-Mixtral)

@slaren Should we try to multi-thread get_rows for n_rows > 1? Maybe it can lead to some gains for prompt processing, even with the overhead from starting threads

@slaren
Copy link
Member

slaren commented Mar 12, 2024

I actually already implemented multi-threaded get rows for pipeline parallelism (see here: 602a719), but I ended not using it because it is still slower than just using 1 thread. However it will use multiple threads if there are other operations in the graph that also support multi-threading.

@ggerganov
Copy link
Member

Nice. Surprising that it does not help even for large batches. Seem the thread-start overhead is significantly higher than my expectations and how it compares to get_rows compute/mem requirements

Btw, what was the reasoning for not offloading the input layer to the GPU? The comment mentions that it leads to little benefit - what are the downsides?

@slaren
Copy link
Member

slaren commented Mar 12, 2024

The downside of offloading the input layer is higher VRAM usage, as the token embeddings weights can be quite big.

@Neb2653
Copy link
Author

Neb2653 commented Mar 15, 2024

The chat is stable now having 32 parallel sessions.

I think we can close this topic. Thanks everyone for the response, it was very helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants