feat: mmapped ggjt loader #129

jon-chuang · 2023-04-12T07:01:55Z

Fixes the issues in #125

Improvements:

Loading 7B Vicuna (q4):
- default: warm start: 1785ms, cold start: 2618ms
- --features="mmap": warm start: 7ms, cold start: 38ms
Loading 13B Vicuna (q4):
- default: warm start: 4833ms, cold start: 5905ms
- --features="mmap": warm start: 9ms, cold start: 33ms

So we get a 250X-500X speedup! Higher than the advertised 10-100x :)

@iacore

Now it can load the model, but it's not working

in math, tensor loading

jon-chuang · 2023-04-12T09:21:19Z

Btw, the best part about this is that the OS page cache is reused between llama.cpp and llama-rs!

jon-chuang · 2023-04-12T14:40:10Z

Another optimization - we don't have to allocate memory for new_tensor_{}d for mmap case.

The solution is simple. As per llama.cpp (https://github.com/ggerganov/llama.cpp/blob/f76cb3a34d6a6b03afb96650e39495f201eac042/llama.cpp#L933), set ctx.no_alloc to true.

EDIT: due to lazy allocation of OS page by malloc (see here) this ends up not mattering. Still, I think it's better not to malloc when not needed.

I've removed the unnecessary malloc.

iacore · 2023-04-12T16:30:17Z

Can confirm this is working

The cost of page fault is not paid until first access, so 38ms is definitely not right

iacore · 2023-04-12T16:39:34Z

The llama-loader crate is still WIP (can only load GGJT, and isn't used by llama-rs). I think it's better for you to rebase this on cc846ae

jon-chuang · 2023-04-12T18:36:20Z

The cost of page fault is not paid until first access, so 38ms is definitely not right

I suppose. But it requires measuring the inference time.

At the least, the user can interact right away, which is a plus.

jon-chuang · 2023-04-13T09:24:48Z

can only load GGJT, and isn't used by llama-rs

We should add tests to show proper loading for all these formats.

We need some simple models, we can generate them with a simple script. I'd also like to implement a way to dump the loaded model into a file of chosen format.

philpax · 2023-04-13T09:53:08Z

I would suggest making this a PR to iacore's work, so that we can merge #125 and have this included.

iacore · 2023-04-19T18:04:08Z

I think this should be merged first. #125 has less user-facing features than this.

philpax · 2023-04-19T23:29:27Z

I've subsumed this into #125 - thanks for the PR, awesome to see mmap working 🚀

iacore and others added 16 commits April 8, 2023 13:25

Add loader stub for GGJT

bdbea68

Add loading code for ggjt

b0a666f

Now it can load the model, but it's not working

code cleanup that doesn't change anything

9eefdc5

more code cleanup

c212c53

minor change

bfaec3a

in math, tensor loading

Add non-mmap loader for GGJT

b6044ee

Prefer traits in loader.rs

1872dda

cargo fmt

ec1fca7

cargo clippy --fix

cc846ae

Remove ggml::Tensor::set_data

bf847dd

fix(llama): buffer tokens until valid UTF-8

ea7094c

Add standalone loader

c848d5e

Move loader to standalone crate llama-loader

8390593

[llama-loader] Support non-copy loader

15fe19b

Use functions from the new crate

2e9311d

fix

ffe1365

jon-chuang changed the title ~~mmapped loader~~ feat: mmapped ggjt loader Apr 12, 2023

This was referenced Apr 12, 2023

Investigate concurrent inference across threads with one model #95

Closed

feat: multiple sessions with thread-local loaded models #134

Closed

jon-chuang added 2 commits April 13, 2023 00:18

no_alloc

e14bf80

clippy

8440a89

jon-chuang mentioned this pull request Apr 13, 2023

perf: Investigate performance discrepancy with llama-rs - 1.5x-2x slower ggml-org/llama.cpp#932

Closed

philpax closed this Apr 19, 2023

philpax mentioned this pull request Apr 20, 2023

Standalone loader #125

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: mmapped ggjt loader #129

feat: mmapped ggjt loader #129

jon-chuang commented Apr 12, 2023 •

edited

Loading

jon-chuang commented Apr 12, 2023

jon-chuang commented Apr 12, 2023 •

edited

Loading

iacore commented Apr 12, 2023 •

edited

Loading

iacore commented Apr 12, 2023

jon-chuang commented Apr 12, 2023

jon-chuang commented Apr 13, 2023 •

edited

Loading

philpax commented Apr 13, 2023

iacore commented Apr 19, 2023

philpax commented Apr 19, 2023

feat: mmapped ggjt loader #129

feat: mmapped ggjt loader #129

Conversation

jon-chuang commented Apr 12, 2023 • edited Loading

jon-chuang commented Apr 12, 2023

jon-chuang commented Apr 12, 2023 • edited Loading

iacore commented Apr 12, 2023 • edited Loading

iacore commented Apr 12, 2023

jon-chuang commented Apr 12, 2023

jon-chuang commented Apr 13, 2023 • edited Loading

philpax commented Apr 13, 2023

iacore commented Apr 19, 2023

philpax commented Apr 19, 2023

jon-chuang commented Apr 12, 2023 •

edited

Loading

jon-chuang commented Apr 12, 2023 •

edited

Loading

iacore commented Apr 12, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023 •

edited

Loading