Making weights loading faster #216

oKatanaaa · 2023-03-16T22:39:40Z

Tried to address slow weights loading. 7B is okay, but 13B is really slow (several minutes), hard to experiment/prototype with larger models.

Replaced std::ifstream with C-style file reading using fopen. Got a considerable boost in loading performance: 3x to 10x faster on my machine (measuring results were kinda inconsistent, but it is definitely a lot faster than before).

I made sure the weights are correctly loaded: fixed the seed, gave the same prompts - model gives the same output, everything is good.

Also increased the buffer size from 1024*1024 to 128*1024*1024 (see line 102) which gave a slight boost as well. Though I am not sure whether it is optimal for edge devices like Raspberry Pi (if it's of any concern).

Kangmo

looks good to me

sw · 2023-03-18T11:34:36Z

Is this faster than using mmap (see #91 / #150)?

oKatanaaa · 2023-03-18T13:31:46Z

Is this faster than using mmap (see #91 / #150)?

Just tested out #150 implementation, it was a lot slower than simple fopen. I believe #150 shoud've been faster on subsequent runs, but I didn't see any perceptible speed up, maybe just a little.

Piezoid · 2023-03-18T14:15:54Z

Is this faster than using mmap (see #91 / #150)?

By implementing #91 correctly, we can prevent creating any redundant copies of the model. This is crucial because if we duplicate it and there's enough space to store just one instance in memory, we run the risk of removing it from the operating system's file system cache.

apaz-cli · 2023-03-18T18:03:59Z

@oKatanaaa switching between std::ifstream and FILE* should make no measurable difference. They are both tunable, do conceptually the exact same thing, and support (almost) exactly the same set of operations. The main bottleneck is the time spent inside of system calls reading from disk, and they make those calls exactly or almost exactly the same way. The first implementations literally called fopen() directly. I forget if they still do.

Your testing methodology is probably at fault. Files that are opened once are likely to be opened again soon after. OSes like to cache the contents of files (and shared memory pages also) so that subsequent reads are faster. On linux you can flush these file and page caches with:

sync && echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null

You may have noticed that with the vanilla build on master loading the model sometimes skips ahead by quite a few dots in the progress bar, usually at the beginning, but sometimes a few times in the middle. This is why.

That said, I recommend that you don't do this. Justine's version of #150 is much better than mine. To use it, you can just clone the repo and switch to the mmap branch. If you load the model once, subsequent loads happen instantaneously, as long as the kernel hasn't had any reason to evict the model to free up space.

As discussed on #91, it's not a perfect solution, but it's pretty great. Actually properly fixing this problem will require making changes to the ggml file format, and possibly the library. I recommend using Justine's branch in the meantime.

oKatanaaa · 2023-03-18T21:50:53Z

@apaz-cli Thanks for the input. I initially thought the same because performance comparisons for fstream vs fopen I found were not really consistent (and very speculative) and depended on a particular case. Though intuition was telling me fopen should be faster for binaries (as I recall from my experience). So I decided to see if anything changes and implemented fopen model load. To my surprise, the performance gains were substantial, that's why I decided to make a pull request.

Speaking of the mmap implementation, as far as I understand (though I don't understand a thing about mmap really) it simply creates a large binary where model weights are stored contiguiosly (I guess?) and uses it to make faster subsequent loads. The model loading code is not changed at all, only the fancy memory allocation stuff, meaning that my change and Justine's code are perpendicual and can benefit each other.

Getting back to performance comparison of fstream vs fopen (and also mmap), I did some testing and here are results:

Implemetation	First load	Subsequent load	Read speed (mb/s) on subsequent run
master	1m 49s	1m 50s	50
mmap	8m 50s	1m 10s	20-30
fopen	13s	13s	200-300
mmap + fopen	6m 20s	1m 8s	20-30

My machine:

CPU: Ryzen 7 4800H
RAM: 32GB, 3200MHz (not sure if it's dual channel)
SSD: ADATA SX6000PNP 1TB

Testing was done using 13B model. I measured time (both for first and subsequent loads) till the model processed the first token. I also monitored my disk current read/write speed. I made sure to run sync && echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null for each 'first load" and did several subsequent loads picking an average result. Deviations from run to run weren't too large, so the numbers in the table should tell a general picture of things.
I also implemented mmap + fopen by combining my code with the mmap implementation and it did give a slight boost. Though mmap implementation does a lot of writes to memory (page file I guess?) taking up a huge portion of all loading time, so on the first run we'll have to wait still.

An interesting thing is that subsequent runs on mmap still run a lot slower than fopen. I am not sure what's going on as I expected an instantaneous load (but intuition tells me that custom malloc and realloc read required tensors from disk on request as the model is being evaluated during the first run, please correct me if I'm wrong).

As a summary:

fopen is indeed faster than fstream in this case, at least on my machine. It would be really good if other people could confirm performance gains on their machines.
fopen (current PR code) does not conflict with the mmap implementation in any way and both can be merged bringing benefits of both sides.
Current implementation of mmap may not work as expected (still slow subsequent runs), though there might be something up with my machine. Again, it'd be good to have other people verify these results.

jart · 2023-03-18T23:37:19Z

When you say mmap, are you talking about #150? Because that was the first iteration of our mmap() work. The second iteration we did is in the https://github.com/ggerganov/llama.cpp/tree/mmap branch. That code doesn't have a load phase at all in subsequent runs.

It's also important to specify (1) what operating system you're using, and (2) what type of weights you're using. Even with the code in the master branch, the 13B model, with either q4 or f16 weights, your computer should not be taking 1+ minutes on subsequent runs. On my $1000 workstation, subsequent runs with the old method take ~3 seconds to load 13B F16. With the new method, it loads in 0 seconds. The only way I could see a computer taking >1m given your hardware, would be if it were running Windows, or if you had so many tabs open in your browser that there was practically no memory available for caches. For example, my workstation is headless and I use it via SSH, so that things like Chrome don't influence benchmarks.

oKatanaaa · 2023-03-19T07:26:26Z

@jart I was using the mmap branch. I believe I earlier tried #150 and it didn't work for me well.

(1) I am on a Windows 11 machine, but building and running code inside a docker container (kitware/cmake:ci-debian10-x86_64-2023-03-08). Just didn't want to have headaches from win toolchain (as I never worked with cmake on win), so opted for the simplest route.
(2) Q4 weights.

Anyways, I tried building for win and to my surprise fopen and master implementations matched loading speeds (4 seconds, blazing fast). Now it completely agrees with @apaz-cli argument. It seems slow loading times are due to I/O problems inside the container and I am not sure why there is a such dramatic difference.

Sadly I couldn't build mmap for win.

A little bit of googling gave these issues where people were struggling with slow I/O for mounted volumes (though on Mac):

It also seems like the same issue might be present on Linux machines as well: https://forums.docker.com/t/docker-extremely-slow-on-linux-and-windows/129752

Seems like everything works well when building stuff 'natively'. So the benefits of this PR are not clear in native settings, but it might be an option in others (again, only to have other people try out the code and see if it works better for them).

jart · 2023-03-19T08:26:33Z

It's good that you're running Windows, because where we really need your help right now is in making sure our mmap branch is able to build with MSVC. Neither @apaz-cli or myself have access to MSVC at the moment. I wrote some polyfills and pushed them to the branch that I believe will work with WIN32. However the change was written a priori and needs to be cleaned up and debugged by someone who has MSVC. Otherwise we won't be able to merge the improvements into the master branch.

We could still possibly find ways to improve the experience for people who are using virtual file systems like Docker. However I would want to see a better theory that explains why the performance is bad and why it's actionable on our part. We can't help you find those answers. For example, I'm not convinced that using C stdio rather than C++ stdio is itself in any way different. An example of something that would possibly convince me, would be like, "if we use this stride obtained from the statfs() block size then we see optimal i/o behavior" or "if we have a second thread prefaulting pages".

Another thing that would convince me is if you showed me a perf record profiling report showing, "here we see my glibc version of libstdc++ has a bug where it spends all its time chewing up CPU rather than actually doing I/O therefore we should just use C stdio." But until that happens, I can't merge a change like this.

oKatanaaa · 2023-03-19T12:18:13Z

@jart I'd be glad to be of any help with debugging mmap for WIN32. What's needed right now is to clone mmap branch, (try to) compile it using MSVC and see if there are any bugs in there, is that right? And where do I report my findings? #91?

I will also try to do some investigation with regards to loading performance inside docker.

apaz-cli · 2023-03-19T17:50:22Z

@oKatanaaa The branch is already in the repo. Just git pull origin and git checkout mmap.

jart · 2023-03-19T18:37:25Z

You can report in #91. You can also send a pull request that targets the mmap branch.

anzz1

This is great !

However, merge conflicts require solving.

CoderRC · 2023-03-28T01:15:20Z

I have created my own library that implements mmap using mingw32 that makes this project maintainable for windows. It is possible to compile the program using library from https://github.com/CoderRC/libmingw32_extended, make changes like in #564 and the specific make command below:
make LDFLAGS='-D_POSIX_MAPPED_FILES -lmingw32_extended'

anzz1 · 2023-03-29T03:48:37Z

There is no need to use a separate library to replace C++ fstreams with C file streams, as C file streams work perfectly in Windows too. You are conflating two things in the comments now, which are using mmap() , which is non-portable and requires an external library or polyfill, while the PR is about using fopen() and fread() , both of which are portable and do not require any additions.

jart · 2023-03-29T06:40:27Z

@oKatanaaa do you still want to move forward on this PR? I'm still not convinced that c++ stl vs. c stdio is going to make a measurable difference, unless our goal was to convert the c++ code to c. Our mmap() work on the other hand, makes weights load 10x faster for me on Windows now. And 100x faster on Linux.

jart · 2023-03-29T06:48:41Z

@anzz1 I looked at @CoderRC's project and it looks pretty good and it's unfair to say mmap() isn't portable because it's standardized by the IEEE. You'll also be happy to hear we now have our own working WIN32 mmap() polyfill, in our mmap branch: https://github.com/ggerganov/llama.cpp/blob/mmap/mmap.c Me and @oKatanaaa spent the past few weeks building that. As I mentioned in my earlier comment, it's going to help make loading larger models with multi-dimensional tensors (e.g. 13B) about 10x to 100x faster, in terms of wall time. It doesn't require any libraries or dependencies. You're encouraged to take a look at the code and share your feedback!

oKatanaaa · 2023-03-30T07:45:14Z

@oKatanaaa do you still want to move forward on this PR? I'm still not convinced that c++ stl vs. c stdio is going to make a measurable difference, unless our goal was to convert the c++ code to c. Our mmap() work on the other hand, makes weights load 10x faster for me on Windows now. And 100x faster on Linux.

I still want to tinker with the code a bit and find the reasons of slow loading inside docker on win, but I am focusing on other things at the moment, so we can close the PR. I'll try to do some research regarding this problem once I have some time. If I find anything I'll make a new PR with new fixes.

oKatanaaa added 2 commits March 16, 2023 22:22

refactor: make weights load faster

7252a2b

minor change

fc924e9

gjmulder added enhancement New feature or request performance Speed related topics labels Mar 17, 2023

Kangmo approved these changes Mar 18, 2023

View reviewed changes

merge

b10383f

anzz1 approved these changes Mar 25, 2023

View reviewed changes

anzz1 mentioned this pull request Mar 26, 2023

Ability for ./main to keep the model in memory and pass it more text #23

Closed

anzz1 mentioned this pull request Mar 29, 2023

Add support for memory mapping models #586

Closed

4 tasks

anzz1 mentioned this pull request Mar 29, 2023

Enable -std= for cmake builds, fix warnings #598

Merged

oKatanaaa closed this Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making weights loading faster #216

Making weights loading faster #216

oKatanaaa commented Mar 16, 2023

Kangmo left a comment

sw commented Mar 18, 2023

oKatanaaa commented Mar 18, 2023

Piezoid commented Mar 18, 2023

apaz-cli commented Mar 18, 2023

oKatanaaa commented Mar 18, 2023 •

edited

Loading

jart commented Mar 18, 2023

oKatanaaa commented Mar 19, 2023

jart commented Mar 19, 2023

oKatanaaa commented Mar 19, 2023

apaz-cli commented Mar 19, 2023

jart commented Mar 19, 2023

anzz1 left a comment

CoderRC commented Mar 28, 2023

anzz1 commented Mar 29, 2023 •

edited

Loading

jart commented Mar 29, 2023

jart commented Mar 29, 2023

oKatanaaa commented Mar 30, 2023

Making weights loading faster #216

Making weights loading faster #216

Conversation

oKatanaaa commented Mar 16, 2023

Kangmo left a comment

Choose a reason for hiding this comment

sw commented Mar 18, 2023

oKatanaaa commented Mar 18, 2023

Piezoid commented Mar 18, 2023

apaz-cli commented Mar 18, 2023

oKatanaaa commented Mar 18, 2023 • edited Loading

jart commented Mar 18, 2023

oKatanaaa commented Mar 19, 2023

jart commented Mar 19, 2023

oKatanaaa commented Mar 19, 2023

apaz-cli commented Mar 19, 2023

jart commented Mar 19, 2023

anzz1 left a comment

Choose a reason for hiding this comment

CoderRC commented Mar 28, 2023

anzz1 commented Mar 29, 2023 • edited Loading

jart commented Mar 29, 2023

jart commented Mar 29, 2023

oKatanaaa commented Mar 30, 2023

oKatanaaa commented Mar 18, 2023 •

edited

Loading

anzz1 commented Mar 29, 2023 •

edited

Loading