MacOS: mlock/wired memory/cold start #9029
Replies: 3 comments
-
I've seen this behaviour as well on my Mac and I don't know how to fix it. Seems like some internal caching mechanism |
Beta Was this translation helpful? Give feedback.
-
Got it, thank you. |
Beta Was this translation helpful? Give feedback.
-
I couldn't figure out it either, but for single-user setups seems like sending warmup/keepalive queries is working reasonably good. keepalive_dst.mp4Here we keep sending context + message as we type, even if message hasn't changed, and (at least for reasonably short context) we can get rid of both model loading lag + hide context encoding latency. |
Beta Was this translation helpful? Give feedback.
-
Let's say we start llama server like this:
It is a large model (120GB+) but with very short context. Hardware is M2 Ultra 192GB.
Now let's query it 3 times with the same prompt and no prompt cache:
I see the following times:
This are timings from the server log itself:
First prompt processing is much slower(cold start?). We don't use cache, we can see that from
kv cache rm [p0, end)
wherep0=0
.If I add a sleep between the calls like this:
all three calls become equally slow:
Is there a way to avoid this cold start problem? Is there any way to keep model always loaded (other than sending mock keepalive queries)?
With
--mlock
I see a difference in reported system metrics (memory stayswired
, without mlock wired goes down to 0), but there's no measurable difference in latency.I think
llama-cli
has the same behavior, it's just easier to reproduce with server queries.Thank you!
Beta Was this translation helpful? Give feedback.
All reactions