Make loading weights 10-100x faster #613

jart · 2023-03-29T23:51:39Z

This is a breaking change that's going to give us three benefits:

Your inference commands should load 100x faster
You may be able to safely load models 2x larger
You can run many concurrent inference processes

This was accomplished by changing the file format so we can mmap()
weights directly into memory without having to read() or copy them
thereby ensuring the kernel can make its file cache pages directly
accessible to our inference processes; and secondly, that the file
cache pages are much less likely to get evicted (which would force
loads to hit disk) because they're no longer competing with memory
pages that were needlessly created by gigabytes of standard i/o.

The new file format supports single-file models like LLaMA 7b, and
it also supports multi-file models like LLaMA 13B. Our Python tool
now merges the foo.1, foo.2, etc. files back into a single file so
that the C++ code which maps it doesn't need to reshape data every
time. That's made llama.cpp so much simpler. Much of its load code
has now been deleted.

Furthermore, this change ensures that tensors are aligned properly
on a 32-byte boundary. That opens the door to seeing if we can get
additional performance gains on some microprocessors, by using ops
that require memory alignment.

Lastly note that both POSIX and the Windows platform are supported

The issue this PR solves is #91

This PR was written in collaboration with @slaren. This PR is also rebased on
PR #586 so please do not squash merge! Use either merge or rebase.

luminalle · 2023-03-30T00:31:04Z

Should the other converters also be rewritten to handle this new format?

jart · 2023-03-30T00:51:49Z

Yes indeed. I just fixed the quantize program. Now I'm hunting down all the tests.

jart · 2023-03-30T01:24:48Z

All tests look green except for a CMake test. For example: https://github.com/ggerganov/llama.cpp/actions/runs/4559537462/jobs/8043597142?pr=613 I'm stumped on this error. I can't figure out where the file models/ggml-vocab.bin comes from. Does anyone know? Could it be a stale cache?

FNsi · 2023-03-30T03:56:45Z

All tests look green except for a CMake test. For example: https://github.com/ggerganov/llama.cpp/actions/runs/4559537462/jobs/8043597142?pr=613 I'm stumped on this error. I can't figure out where the file models/ggml-vocab.bin comes from. Does anyone know? Could it be a stale cache?

#355 mentioned "Added ./models/ggml-vocab.bin containing just LLaMA vocab data (used for tests)"

bakkot · 2023-03-30T05:35:29Z

llama.h

@@ -20,7 +20,7 @@
 #endif

 #define LLAMA_FILE_VERSION 1
-#define LLAMA_FILE_MAGIC 0x67676d66 // 'ggmf' in hex


Nit: why change the magic rather than the version? I assumed the plan was to keep the magic constant forever. If you bump the version instead, old executables will recognize new model files and give a more useful error message. And it's nice to distinguish between "this is definitely a model file for this project, but it's the wrong version" vs "this is some random junk we don't know anything about".

(This PR is a very neat bit of engineering; please don't let my nitpick distract from that.)

not a nitpick but a real change request :)

ggerganov · 2023-03-30T06:04:06Z

@jart
The models/ggml-vocab.bin is generated by convert-pth-to-ggml.py by providing an extra arg.

I had the expectation that mmap support would be much more intrusive, but in fact it turned out to be very compact. llama.cpp is much more simpler now. Good stuff

Regarding the version comment - yes, the plan was to bump versions and no the magic. But I'm ok to change the magic to commemorate the significance of this update. In fact, maybe we can make this a thing and everybody who makes a significant contribution to the project will get their initials appended to the version. What do you think? 😄

Let me play with this tonight before merging. We have to make special care that all the other ggml model files floating around (Alpaca, GPT4All, Chinese LLaMA, etc.) have a nice way to convert to this new format and update the instructions in the README.

Also, maybe some synchronisation with #545 would be needed

This is a breaking change that's going to give you three benefits: 1. Your inference commands should load 100x faster 2. You may be able to safely load models 2x larger 3. You can run many concurrent inference processes This was accomplished by changing the file format so we can mmap() weights directly into memory without having to read() or copy them thereby ensuring the kernel can make its file cache pages directly accessible to our inference processes; and secondly, that the file cache pages are much less likely to get evicted (which would force loads to hit disk) because they're no longer competing with memory pages that were needlessly created by gigabytes of standard i/o. The new file format supports single-file models like LLaMA 7b, and it also supports multi-file models like LLaMA 13B. Our Python tool now merges the foo.1, foo.2, etc. files back into a single file so that the C++ code which maps it doesn't need to reshape data every time. That's made llama.cpp so much simpler. Much of its load code has now been deleted. Furthermore, this change ensures that tensors are aligned properly on a 32-byte boundary. That opens the door to seeing if we can get additional performance gains on some microprocessors, by using ops that require memory alignment. Lastly note that both POSIX and the Windows platform are supported Fixes ggerganov#91

jart · 2023-03-30T07:27:15Z

File updated. A lot more tests are green now. No idea what's up with the sanitizer.

I thought so too! I too was pleasantly surprised by how well it worked out. Glad we took a few weeks to think.

I'm honored to hear you say that. I can roundup the magic to 64 bytes if you like, so there's room to hand out kudos without breaking backwards compatibility in the future. Since my initials also act as a stamp of approval, I'm going to be sending a follow-up change after this, that'll harden the loading code, so that folks will be able to trade model files for this format on HuggingFace with maximum safety and confidence.

#545 is an ambitious unification. I've done my best to comment my changes to make the merge less painful for the author. I've sought to update the other scripts too, but don't know how to run them. One thing you could also consider with this project is having a contrib/ folder, where folks can merge as much of their own stuff as they want, under the expectation that the ones who need it are the ones who maintain it.

mqy · 2023-03-30T09:52:58Z

llama.cpp

+    int fd = open(fname, O_RDONLY);
+    if (fd == -1) return 0;
+    int64_t length = lseek(fd, 0, SEEK_END);
+    void *addr = mmap(NULL, length, PROT_READ, MAP_SHARED, fd, 0);


Is it more safe to use mmap64 for 4GB+ files?

It seems mmap, mmap64 and MapViewOfFile support mapping from given offset. Is it possible to map from header_len (as offset)? If we can do this, no need to align model file, right?

The right thing to do on 32-bit platforms is to have your build system define -D_FILE_OFFSET_BITS=64 which will cause your system header files to automatically #define mmap mmap64

File offsets passed to mmap() need to be page size aligned, so I don't think so.

@jart Is it possible to ensure the file size is a multiple of the hugepage size (e.g. using ftruncate), to benefit from fewer TLB lookups when the model data is accessed? (corresponding mmap hints or other system-specific APIs, e.g. needed for macOS, might need to be used)

It doesn't matter with mmap() if the file length isn't page size aligned, even with smaller pages. You should be good to go if you modify the mmap() code in llama.cpp by hand and actually manage to get huge pages to work without nuking your machine :-)

convert-pth-to-ggml.py

llama.cpp

If you deleted your old Meta LLaMA .pth files, then the migrate-ggml-2023-03-30-pr613.py script will allow you to convert your old ggml files into the new mmap()'able format. See ggerganov#613

jart · 2023-03-30T12:52:14Z

@ggerganov This change now includes a migration tool named migrate-ggml-2023-03-30-pr613.py. This will ensure that users of the old GGML file format who've deleted the original .pth files, will be able to convert their ggml+ggmf files to the new ggml+ggjt format. Please take a look.

x02Sylvie · 2023-03-30T14:01:29Z

Having issue migrating alpaca model ggml-alpaca-13b-q4.bin, python script seems to think that model has two n_parts rather than one, adding --n_parts argument to conversion script to manually specify --n_parts 1 just like when running alpaca models on llama.cpp might resolve the issue?

jart · 2023-03-30T14:11:30Z

@x02Sylvie I don't have access to the Alpaca model. Could send a pull request fixing that after this gets merged?

x02Sylvie · 2023-03-30T14:19:16Z

I don't really know python, so I'd rather leave pull request to someone smarter than me,

I did however manage to get alpaca 13b model converted by manually setting n_parts to 1 in .py conversion script . I'm unsure if it's proper place to set n_parts though

def get_n_parts(dim):
    mappings = {4096: 1, 5120: 2, 6656: 4, 8192: 8}

    n_parts = mappings.get(dim)

    if n_parts is None:
        print(f"Invalid dim: {dim}")
        sys.exit(1)
    print(f"n_parts = {n_parts}\n")
    return n_parts

to

def get_n_parts(dim):
    mappings = {4096: 1, 5120: 2, 6656: 4, 8192: 8}

    n_parts = 1

    if n_parts is None:
        print(f"Invalid dim: {dim}")
        sys.exit(1)
    print(f"n_parts = {n_parts}\n")
    return n_parts

Model does work however after conversion

gaceladri · 2023-03-31T17:19:03Z

Hello,

I can not load the gtp4all after converting it to the new ggml format using your script:
python3 convert-gpt4all-to-ggml.py models/gpt4all/gpt4all-lora-quantized.bin ./models/tokenizer.model

I have opened a new issue probably related to this: #655 (comment)

gaceladri · 2023-03-31T17:33:03Z

I could run it with the previous version https://github.com/ggerganov/llama.cpp/tree/master-ed3c680

Hello,

I can not load the gtp4all after converting it to the new ggml format using your script: python3 convert-gpt4all-to-ggml.py models/gpt4all/gpt4all-lora-quantized.bin ./models/tokenizer.model

I have opened a new issue probably related to this: #655 (comment)

rabidcopy · 2023-03-31T17:46:09Z

Hello,

I can not load the gtp4all after converting it to the new ggml format using your script: python3 convert-gpt4all-to-ggml.py models/gpt4all/gpt4all-lora-quantized.bin ./models/tokenizer.model

I have opened a new issue probably related to this: #655 (comment)

You need to also run the resulting file through migrate-ggml-2023-03-30-pr613.py as well.

gpt4all weights -> convert-gpt4all-to-ggml.py -> converted gpt4all weights -> migrate-ggml-2023-03-30-pr613.py -> gpt4all weights compatible with the latest version of llama.cpp

gaceladri · 2023-03-31T17:50:32Z

It worked. Thank you for your fast response!

If you deleted your old Meta LLaMA .pth files, then the migrate-ggml-2023-03-30-pr613.py script will allow you to convert your old ggml files into the new mmap()'able format. See ggerganov#613

asklar · 2023-04-01T01:55:08Z

great work @jart and @slaren ! <3

As noted in https://github.com/ggerganov/llama.cpp/blob/master/migrate-ggml-2023-03-30-pr613.py, Authors from `llama.cpp` caused a breaking change to the file format on 2023-03-30 in: ggerganov/llama.cpp#613 Therefore, we need further use `migrate-ggml-2023-03-30-pr613.py` to convert the llama model.

slaren added 6 commits March 29, 2023 16:36

Add mmap support for model files

2a6cef6

Fix ggml_init_params in quantize

a1e0f17

Make mmap_file static

4ae12d0

Unmap the file in llama_free

4daaa5e

Always initialize mm_addr and mm_length in llama_model

812cfa1

Initial windows support (untested)

80c2178

jart added performance Speed related topics breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. labels Mar 29, 2023

jart mentioned this pull request Mar 30, 2023

Add support for memory mapping models #586

Closed

4 tasks

jart force-pushed the loader branch from 69debdf to b806987 Compare March 30, 2023 00:51

jart force-pushed the loader branch from b806987 to a3307d2 Compare March 30, 2023 01:14

bakkot reviewed Mar 30, 2023

View reviewed changes

jart force-pushed the loader branch from a3307d2 to 75d1e55 Compare March 30, 2023 07:09

Ensure --mlock works properly with mmap() support

a45e843

mqy reviewed Mar 30, 2023

View reviewed changes

sw reviewed Mar 30, 2023

View reviewed changes

convert-pth-to-ggml.py Outdated Show resolved Hide resolved

convert-pth-to-ggml.py Outdated Show resolved Hide resolved

llama.cpp Show resolved Hide resolved

jart force-pushed the loader branch from f013c39 to c0f330f Compare March 30, 2023 12:48

prusnak mentioned this pull request Mar 30, 2023

drop quantize.py (now that models are using a single file) #640

Merged

edwios mentioned this pull request Mar 31, 2023

Failed to load llama model ggerganov/whisper.cpp#702

Closed

headllines bot mentioned this pull request Apr 1, 2023

Hacker News Daily Top 10 @2023-04-01 headllines/hackernews-daily#990

Open

johncadengo mentioned this pull request Apr 1, 2023

too slow? serge-chat/serge#65

Closed

github-actions bot mentioned this pull request Apr 1, 2023

Hacker News Daily Top 30 @2023-04-01 meixger/hackernews-daily#195

Open

HRezaei mentioned this pull request Apr 1, 2023

is it normal 30-minute/token slowness in intel xeon? cocktailpeanut/dalai#250

Open

magicmars35 mentioned this pull request Apr 1, 2023

Llama.cpp 30B runs with only 6GB of RAM now - update ? antimatter15/alpaca.cpp#182

Closed

xueyuanl mentioned this pull request Apr 1, 2023

Daily Hacker News 01-04-2023 xueyuanl/daily-hackernews#936

Open

kovaacs mentioned this pull request Apr 1, 2023

Backport the performance improvement from llama.cpp ggerganov/whisper.cpp#709

Closed

github-actions bot mentioned this pull request Apr 2, 2023

2023-04-01 Hot Posts jiacai2050/mofish#240

Open

x02Sylvie mentioned this pull request Apr 2, 2023

Windows page fault disk i/o slow on first load #705

Closed

This was referenced Apr 2, 2023

Regression: "The first main on the moon was " #693

Closed

Bring back the ggml model format and revert breaking mmap change (#613) #711

Closed

headllines bot mentioned this pull request Apr 3, 2023

Hacker News Weekly Top 10 @2023-04-03 headllines/hackernews-weekly#163

Open

A2va mentioned this pull request Apr 3, 2023

Towards a C++ library NolanoOrg/cformers#36

Open

trollkotze mentioned this pull request Apr 4, 2023

Change mmap parameters to avoid much swap thrashing #753

Closed

ShoufaChen mentioned this pull request Apr 4, 2023

fix https://github.com/hwchase17/langchain/issues/2392 langchain-ai/langchain#2393

Merged

prusnak mentioned this pull request Apr 5, 2023

Fix magic in convert-gptq-to-ggml.py #770

Closed

akumaburn mentioned this pull request Apr 9, 2023

[User] Memory usage is extremely low when running 65b 4-bit models. (Only use 5GB) #864

Closed

johnson442 mentioned this pull request Jul 3, 2023

Starcoder mmap (and gpu) example ggerganov/ggml#338

Merged

kripper mentioned this pull request Aug 27, 2023

85%+ of the llama model could be redundant #989

Closed

guevara mentioned this pull request Aug 9, 2024

LLaMA Now Goes Faster on CPUs guevara/read-it-later#11711

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make loading weights 10-100x faster #613

Make loading weights 10-100x faster #613

jart commented Mar 29, 2023

luminalle commented Mar 30, 2023

jart commented Mar 30, 2023

jart commented Mar 30, 2023

FNsi commented Mar 30, 2023 •

edited

Loading

bakkot Mar 30, 2023 •

edited

Loading

Green-Sky Mar 30, 2023

Green-Sky Mar 30, 2023

ggerganov commented Mar 30, 2023 •

edited

Loading

jart commented Mar 30, 2023

mqy Mar 30, 2023

jart Mar 30, 2023

pgoodman Mar 31, 2023 •

edited

Loading

jart Mar 31, 2023

pgoodman Mar 31, 2023

jart commented Mar 30, 2023

x02Sylvie commented Mar 30, 2023

jart commented Mar 30, 2023

x02Sylvie commented Mar 30, 2023 •

edited

Loading

gaceladri commented Mar 31, 2023

gaceladri commented Mar 31, 2023

rabidcopy commented Mar 31, 2023 •

edited

Loading

gaceladri commented Mar 31, 2023

asklar commented Apr 1, 2023

Make loading weights 10-100x faster #613

Make loading weights 10-100x faster #613

Conversation

jart commented Mar 29, 2023

luminalle commented Mar 30, 2023

jart commented Mar 30, 2023

jart commented Mar 30, 2023

FNsi commented Mar 30, 2023 • edited Loading

bakkot Mar 30, 2023 • edited Loading

Choose a reason for hiding this comment

Green-Sky Mar 30, 2023

Choose a reason for hiding this comment

Green-Sky Mar 30, 2023

Choose a reason for hiding this comment

ggerganov commented Mar 30, 2023 • edited Loading

jart commented Mar 30, 2023

mqy Mar 30, 2023

Choose a reason for hiding this comment

jart Mar 30, 2023

Choose a reason for hiding this comment

pgoodman Mar 31, 2023 • edited Loading

Choose a reason for hiding this comment

jart Mar 31, 2023

Choose a reason for hiding this comment

pgoodman Mar 31, 2023

Choose a reason for hiding this comment

jart commented Mar 30, 2023

x02Sylvie commented Mar 30, 2023

jart commented Mar 30, 2023

x02Sylvie commented Mar 30, 2023 • edited Loading

gaceladri commented Mar 31, 2023

gaceladri commented Mar 31, 2023

rabidcopy commented Mar 31, 2023 • edited Loading

gaceladri commented Mar 31, 2023

asklar commented Apr 1, 2023

FNsi commented Mar 30, 2023 •

edited

Loading

bakkot Mar 30, 2023 •

edited

Loading

ggerganov commented Mar 30, 2023 •

edited

Loading

pgoodman Mar 31, 2023 •

edited

Loading

x02Sylvie commented Mar 30, 2023 •

edited

Loading

rabidcopy commented Mar 31, 2023 •

edited

Loading