Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

feat: mmapped ggjt loader #129

Closed
wants to merge 18 commits into from

Conversation

jon-chuang
Copy link
Contributor

@jon-chuang jon-chuang commented Apr 12, 2023

Fixes the issues in #125

Improvements:

  • Loading 7B Vicuna (q4):
    • default: warm start: 1785ms, cold start: 2618ms
    • --features="mmap": warm start: 7ms, cold start: 38ms
  • Loading 13B Vicuna (q4):
    • default: warm start: 4833ms, cold start: 5905ms
    • --features="mmap": warm start: 9ms, cold start: 33ms

So we get a 250X-500X speedup! Higher than the advertised 10-100x :)

@iacore

@jon-chuang jon-chuang changed the title mmapped loader feat: mmapped ggjt loader Apr 12, 2023
@jon-chuang
Copy link
Contributor Author

Btw, the best part about this is that the OS page cache is reused between llama.cpp and llama-rs!

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 12, 2023

Another optimization - we don't have to allocate memory for new_tensor_{}d for mmap case.

The solution is simple. As per llama.cpp (https://github.com/ggerganov/llama.cpp/blob/f76cb3a34d6a6b03afb96650e39495f201eac042/llama.cpp#L933), set ctx.no_alloc to true.

EDIT: due to lazy allocation of OS page by malloc (see here) this ends up not mattering. Still, I think it's better not to malloc when not needed.

I've removed the unnecessary malloc.

@iacore
Copy link
Contributor

iacore commented Apr 12, 2023

Can confirm this is working

The cost of page fault is not paid until first access, so 38ms is definitely not right

@iacore
Copy link
Contributor

iacore commented Apr 12, 2023

The llama-loader crate is still WIP (can only load GGJT, and isn't used by llama-rs). I think it's better for you to rebase this on cc846ae

@jon-chuang
Copy link
Contributor Author

The cost of page fault is not paid until first access, so 38ms is definitely not right

I suppose. But it requires measuring the inference time.

At the least, the user can interact right away, which is a plus.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 13, 2023

can only load GGJT, and isn't used by llama-rs

We should add tests to show proper loading for all these formats.

We need some simple models, we can generate them with a simple script. I'd also like to implement a way to dump the loaded model into a file of chosen format.

@philpax
Copy link
Collaborator

philpax commented Apr 13, 2023

I would suggest making this a PR to iacore's work, so that we can merge #125 and have this included.

@iacore
Copy link
Contributor

iacore commented Apr 19, 2023

I think this should be merged first. #125 has less user-facing features than this.

@philpax
Copy link
Collaborator

philpax commented Apr 19, 2023

I've subsumed this into #125 - thanks for the PR, awesome to see mmap working 🚀

@philpax philpax closed this Apr 19, 2023
@philpax philpax mentioned this pull request Apr 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants