MacOS: mlock/wired memory/cold start #9029

okuvshynov · 2024-08-14T14:28:56Z

okuvshynov
Aug 14, 2024

Let's say we start llama server like this:

./llama.cpp/llama-server --model ./llms/gguf/mistral2407.q8.gguf --chat-template mistral --host 0.0.0.0 --port 8080 -ngl 99 -c 256 --mlock

It is a large model (120GB+) but with very short context. Hardware is M2 Ultra 192GB.

Now let's query it 3 times with the same prompt and no prompt cache:

time curl -X POST http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"messages":[{"role": "user", "content": "Pick a digit."}], "cache_prompt":false,"n_predict": 5}'
time curl -X POST http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"messages":[{"role": "user", "content": "Pick a digit."}], "cache_prompt":false,"n_predict": 5}'
time curl -X POST http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"messages":[{"role": "user", "content": "Pick a digit."}], "cache_prompt":false,"n_predict": 5}'

I see the following times:

% ./test1.sh
...
real    0m6.884s
...
real    0m1.429s
...
real    0m1.439s
...

This are timings from the server log itself:

INFO [   launch_slot_with_task] slot is processing task | tid="0x1f5980c00" timestamp=1723645039 id_slot=0 id_task=24
INFO [            update_slots] kv cache rm [p0, end) | tid="0x1f5980c00" timestamp=1723645039 id_slot=0 id_task=24 p0=0
INFO [           print_timings] prompt eval time     =    6078.54 ms /     8 tokens (  759.82 ms per token,     1.32 tokens per second) | tid="0x1f5980c00" timestamp=1723645046 id_slot=0 id_task=24 t_prompt_processing=6078.536 n_prompt_tokens_processed=8 t_token=759.817 n_tokens_second=1.316106378246341
INFO [           print_timings] generation eval time =     783.38 ms /     5 runs   (  156.68 ms per token,     6.38 tokens per second) | tid="0x1f5980c00" timestamp=1723645046 id_slot=0 id_task=24 t_token_generation=783.38 n_decoded=5 t_token=156.676 n_tokens_second=6.382598483494601
INFO [           print_timings]           total time =    6861.92 ms | tid="0x1f5980c00" timestamp=1723645046 id_slot=0 id_task=24 t_prompt_processing=6078.536 t_token_generation=783.38 t_total=6861.916
INFO [            update_slots] slot released | tid="0x1f5980c00" timestamp=1723645046 id_slot=0 id_task=24 n_ctx=256 n_past=12 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [            update_slots] all slots are idle | tid="0x1f5980c00" timestamp=1723645046
INFO [      log_server_request] request | tid="0x16b513000" timestamp=1723645046 remote_addr="127.0.0.1" remote_port=53868 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [   launch_slot_with_task] slot is processing task | tid="0x1f5980c00" timestamp=1723645046 id_slot=0 id_task=30
INFO [            update_slots] kv cache rm [p0, end) | tid="0x1f5980c00" timestamp=1723645046 id_slot=0 id_task=30 p0=0
INFO [           print_timings] prompt eval time     =     643.03 ms /     8 tokens (   80.38 ms per token,    12.44 tokens per second) | tid="0x1f5980c00" timestamp=1723645047 id_slot=0 id_task=30 t_prompt_processing=643.034 n_prompt_tokens_processed=8 t_token=80.37925 n_tokens_second=12.441021781118883
INFO [           print_timings] generation eval time =     784.10 ms /     5 runs   (  156.82 ms per token,     6.38 tokens per second) | tid="0x1f5980c00" timestamp=1723645047 id_slot=0 id_task=30 t_token_generation=784.101 n_decoded=5 t_token=156.8202 n_tokens_second=6.376729528466358
INFO [           print_timings]           total time =    1427.13 ms | tid="0x1f5980c00" timestamp=1723645047 id_slot=0 id_task=30 t_prompt_processing=643.034 t_token_generation=784.101 t_total=1427.135
INFO [            update_slots] slot released | tid="0x1f5980c00" timestamp=1723645047 id_slot=0 id_task=30 n_ctx=256 n_past=12 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [            update_slots] all slots are idle | tid="0x1f5980c00" timestamp=1723645047
INFO [      log_server_request] request | tid="0x16b59f000" timestamp=1723645047 remote_addr="127.0.0.1" remote_port=53988 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [   launch_slot_with_task] slot is processing task | tid="0x1f5980c00" timestamp=1723645047 id_slot=0 id_task=36
INFO [            update_slots] kv cache rm [p0, end) | tid="0x1f5980c00" timestamp=1723645047 id_slot=0 id_task=36 p0=0
INFO [           print_timings] prompt eval time     =     643.16 ms /     8 tokens (   80.39 ms per token,    12.44 tokens per second) | tid="0x1f5980c00" timestamp=1723645048 id_slot=0 id_task=36 t_prompt_processing=643.16 n_prompt_tokens_processed=8 t_token=80.395 n_tokens_second=12.438584489085143
INFO [           print_timings] generation eval time =     784.21 ms /     5 runs   (  156.84 ms per token,     6.38 tokens per second) | tid="0x1f5980c00" timestamp=1723645048 id_slot=0 id_task=36 t_token_generation=784.212 n_decoded=5 t_token=156.8424 n_tokens_second=6.375826944754735
INFO [           print_timings]           total time =    1427.37 ms | tid="0x1f5980c00" timestamp=1723645048 id_slot=0 id_task=36 t_prompt_processing=643.16 t_token_generation=784.212 t_total=1427.3719999999998
INFO [            update_slots] slot released | tid="0x1f5980c00" timestamp=1723645048 id_slot=0 id_task=36 n_ctx=256 n_past=12 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [            update_slots] all slots are idle | tid="0x1f5980c00" timestamp=1723645048

First prompt processing is much slower(cold start?). We don't use cache, we can see that from kv cache rm [p0, end) where p0=0.

If I add a sleep between the calls like this:

time curl -X POST http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"messages":[{"role": "user", "content": "Pick a digit."}], "cache_prompt":false,"n_predict": 5}'
sleep 2
time curl -X POST http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"messages":[{"role": "user", "content": "Pick a digit."}], "cache_prompt":false,"n_predict": 5}'
sleep 2
time curl -X POST http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"messages":[{"role": "user", "content": "Pick a digit."}], "cache_prompt":false,"n_predict": 5}'

all three calls become equally slow:

INFO [   launch_slot_with_task] slot is processing task | tid="0x1f5980c00" timestamp=1723645149 id_slot=0 id_task=42
INFO [            update_slots] kv cache rm [p0, end) | tid="0x1f5980c00" timestamp=1723645149 id_slot=0 id_task=42 p0=0
INFO [           print_timings] prompt eval time     =    6035.10 ms /     8 tokens (  754.39 ms per token,     1.33 tokens per second) | tid="0x1f5980c00" timestamp=1723645155 id_slot=0 id_task=42 t_prompt_processing=6035.097 n_prompt_tokens_processed=8 t_token=754.387125 n_tokens_second=1.3255793568852332
INFO [           print_timings] generation eval time =     783.31 ms /     5 runs   (  156.66 ms per token,     6.38 tokens per second) | tid="0x1f5980c00" timestamp=1723645155 id_slot=0 id_task=42 t_token_generation=783.308 n_decoded=5 t_token=156.6616 n_tokens_second=6.383185158328525
INFO [           print_timings]           total time =    6818.40 ms | tid="0x1f5980c00" timestamp=1723645155 id_slot=0 id_task=42 t_prompt_processing=6035.097 t_token_generation=783.308 t_total=6818.405
INFO [            update_slots] slot released | tid="0x1f5980c00" timestamp=1723645155 id_slot=0 id_task=42 n_ctx=256 n_past=12 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [            update_slots] all slots are idle | tid="0x1f5980c00" timestamp=1723645155
INFO [      log_server_request] request | tid="0x16b6b7000" timestamp=1723645155 remote_addr="127.0.0.1" remote_port=55577 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [   launch_slot_with_task] slot is processing task | tid="0x1f5980c00" timestamp=1723645157 id_slot=0 id_task=48
INFO [            update_slots] kv cache rm [p0, end) | tid="0x1f5980c00" timestamp=1723645157 id_slot=0 id_task=48 p0=0
INFO [           print_timings] prompt eval time     =    6489.64 ms /     8 tokens (  811.21 ms per token,     1.23 tokens per second) | tid="0x1f5980c00" timestamp=1723645165 id_slot=0 id_task=48 t_prompt_processing=6489.64 n_prompt_tokens_processed=8 t_token=811.205 n_tokens_second=1.2327340191443592
INFO [           print_timings] generation eval time =     783.48 ms /     5 runs   (  156.70 ms per token,     6.38 tokens per second) | tid="0x1f5980c00" timestamp=1723645165 id_slot=0 id_task=48 t_token_generation=783.484 n_decoded=5 t_token=156.6968 n_tokens_second=6.381751254652296
INFO [           print_timings]           total time =    7273.12 ms | tid="0x1f5980c00" timestamp=1723645165 id_slot=0 id_task=48 t_prompt_processing=6489.64 t_token_generation=783.484 t_total=7273.124000000001
INFO [            update_slots] slot released | tid="0x1f5980c00" timestamp=1723645165 id_slot=0 id_task=48 n_ctx=256 n_past=12 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [            update_slots] all slots are idle | tid="0x1f5980c00" timestamp=1723645165
INFO [      log_server_request] request | tid="0x16b743000" timestamp=1723645165 remote_addr="127.0.0.1" remote_port=55727 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [   launch_slot_with_task] slot is processing task | tid="0x1f5980c00" timestamp=1723645167 id_slot=0 id_task=54
INFO [            update_slots] kv cache rm [p0, end) | tid="0x1f5980c00" timestamp=1723645167 id_slot=0 id_task=54 p0=0
INFO [           print_timings] prompt eval time     =    6432.01 ms /     8 tokens (  804.00 ms per token,     1.24 tokens per second) | tid="0x1f5980c00" timestamp=1723645174 id_slot=0 id_task=54 t_prompt_processing=6432.011 n_prompt_tokens_processed=8 t_token=804.001375 n_tokens_second=1.2437789674178106
INFO [           print_timings] generation eval time =     785.11 ms /     5 runs   (  157.02 ms per token,     6.37 tokens per second) | tid="0x1f5980c00" timestamp=1723645174 id_slot=0 id_task=54 t_token_generation=785.105 n_decoded=5 t_token=157.02100000000002 n_tokens_second=6.368574903993734
INFO [           print_timings]           total time =    7217.12 ms | tid="0x1f5980c00" timestamp=1723645174 id_slot=0 id_task=54 t_prompt_processing=6432.011 t_token_generation=785.105 t_total=7217.116
INFO [            update_slots] slot released | tid="0x1f5980c00" timestamp=1723645174 id_slot=0 id_task=54 n_ctx=256 n_past=12 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [            update_slots] all slots are idle | tid="0x1f5980c00" timestamp=1723645174
INFO [      log_server_request] request | tid="0x16b7cf000" timestamp=1723645174 remote_addr="127.0.0.1" remote_port=55887 status=200 method="POST" path="/v1/chat/completions" params={}

Is there a way to avoid this cold start problem? Is there any way to keep model always loaded (other than sending mock keepalive queries)?

With --mlock I see a difference in reported system metrics (memory stays wired, without mlock wired goes down to 0), but there's no measurable difference in latency.

I think llama-cli has the same behavior, it's just easier to reproduce with server queries.

Thank you!

ggerganov · 2024-08-14T14:37:48Z

ggerganov
Aug 14, 2024
Maintainer

I've seen this behaviour as well on my Mac and I don't know how to fix it. Seems like some internal caching mechanism

0 replies

okuvshynov · 2024-08-14T14:43:05Z

okuvshynov
Aug 14, 2024
Author

Got it, thank you.

0 replies

okuvshynov · 2024-08-15T15:42:48Z

okuvshynov
Aug 15, 2024
Author

I couldn't figure out it either, but for single-user setups seems like sending warmup/keepalive queries is working reasonably good.

keepalive_dst.mp4

Here we keep sending context + message as we type, even if message hasn't changed, and (at least for reasonably short context) we can get rid of both model loading lag + hide context encoding latency.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MacOS: mlock/wired memory/cold start #9029

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

MacOS: mlock/wired memory/cold start #9029

okuvshynov Aug 14, 2024

Replies: 3 comments

ggerganov Aug 14, 2024 Maintainer

okuvshynov Aug 14, 2024 Author

okuvshynov Aug 15, 2024 Author

okuvshynov
Aug 14, 2024

ggerganov
Aug 14, 2024
Maintainer

okuvshynov
Aug 14, 2024
Author

okuvshynov
Aug 15, 2024
Author