-
Notifications
You must be signed in to change notification settings - Fork 372
Conversation
Now it can load the model, but it's not working
in math, tensor loading
Btw, the best part about this is that the OS page cache is reused between llama.cpp and llama-rs! |
Another optimization - we don't have to allocate memory for The solution is simple. As per llama.cpp (https://github.com/ggerganov/llama.cpp/blob/f76cb3a34d6a6b03afb96650e39495f201eac042/llama.cpp#L933), set EDIT: due to lazy allocation of OS page by malloc (see here) this ends up not mattering. Still, I think it's better not to malloc when not needed. I've removed the unnecessary malloc. |
Can confirm this is working The cost of page fault is not paid until first access, so 38ms is definitely not right |
The |
I suppose. But it requires measuring the inference time. At the least, the user can interact right away, which is a plus. |
We should add tests to show proper loading for all these formats. We need some simple models, we can generate them with a simple script. I'd also like to implement a way to dump the loaded model into a file of chosen format. |
I would suggest making this a PR to iacore's work, so that we can merge #125 and have this included. |
I think this should be merged first. #125 has less user-facing features than this. |
I've subsumed this into #125 - thanks for the PR, awesome to see |
Fixes the issues in #125
Improvements:
--features="mmap"
: warm start: 7ms, cold start: 38ms--features="mmap"
: warm start: 9ms, cold start: 33msSo we get a 250X-500X speedup! Higher than the advertised 10-100x :)
@iacore