-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making weights loading faster #216
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me
By implementing #91 correctly, we can prevent creating any redundant copies of the model. This is crucial because if we duplicate it and there's enough space to store just one instance in memory, we run the risk of removing it from the operating system's file system cache. |
@oKatanaaa switching between Your testing methodology is probably at fault. Files that are opened once are likely to be opened again soon after. OSes like to cache the contents of files (and shared memory pages also) so that subsequent reads are faster. On linux you can flush these file and page caches with:
You may have noticed that with the vanilla build on master loading the model sometimes skips ahead by quite a few dots in the progress bar, usually at the beginning, but sometimes a few times in the middle. This is why. That said, I recommend that you don't do this. Justine's version of #150 is much better than mine. To use it, you can just clone the repo and switch to the As discussed on #91, it's not a perfect solution, but it's pretty great. Actually properly fixing this problem will require making changes to the ggml file format, and possibly the library. I recommend using Justine's branch in the meantime. |
@apaz-cli Thanks for the input. I initially thought the same because performance comparisons for Speaking of the Getting back to performance comparison of
My machine:
Testing was done using 13B model. I measured time (both for first and subsequent loads) till the model processed the first token. I also monitored my disk current read/write speed. I made sure to run An interesting thing is that subsequent runs on As a summary:
|
When you say mmap, are you talking about #150? Because that was the first iteration of our mmap() work. The second iteration we did is in the https://github.com/ggerganov/llama.cpp/tree/mmap branch. That code doesn't have a load phase at all in subsequent runs. It's also important to specify (1) what operating system you're using, and (2) what type of weights you're using. Even with the code in the master branch, the 13B model, with either q4 or f16 weights, your computer should not be taking 1+ minutes on subsequent runs. On my $1000 workstation, subsequent runs with the old method take ~3 seconds to load 13B F16. With the new method, it loads in 0 seconds. The only way I could see a computer taking >1m given your hardware, would be if it were running Windows, or if you had so many tabs open in your browser that there was practically no memory available for caches. For example, my workstation is headless and I use it via SSH, so that things like Chrome don't influence benchmarks. |
@jart I was using the (1) I am on a Windows 11 machine, but building and running code inside a docker container ( Anyways, I tried building for win and to my surprise Sadly I couldn't build A little bit of googling gave these issues where people were struggling with slow I/O for mounted volumes (though on Mac):
It also seems like the same issue might be present on Linux machines as well: https://forums.docker.com/t/docker-extremely-slow-on-linux-and-windows/129752 Seems like everything works well when building stuff 'natively'. So the benefits of this PR are not clear in |
It's good that you're running Windows, because where we really need your help right now is in making sure our mmap branch is able to build with MSVC. Neither @apaz-cli or myself have access to MSVC at the moment. I wrote some polyfills and pushed them to the branch that I believe will work with WIN32. However the change was written a priori and needs to be cleaned up and debugged by someone who has MSVC. Otherwise we won't be able to merge the improvements into the master branch. We could still possibly find ways to improve the experience for people who are using virtual file systems like Docker. However I would want to see a better theory that explains why the performance is bad and why it's actionable on our part. We can't help you find those answers. For example, I'm not convinced that using C stdio rather than C++ stdio is itself in any way different. An example of something that would possibly convince me, would be like, "if we use this stride obtained from the statfs() block size then we see optimal i/o behavior" or "if we have a second thread prefaulting pages". Another thing that would convince me is if you showed me a |
@jart I'd be glad to be of any help with debugging I will also try to do some investigation with regards to loading performance inside docker. |
@oKatanaaa The branch is already in the repo. Just |
You can report in #91. You can also send a pull request that targets the mmap branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great !
However, merge conflicts require solving.
I have created my own library that implements mmap using mingw32 that makes this project maintainable for windows. It is possible to compile the program using library from https://github.com/CoderRC/libmingw32_extended, make changes like in #564 and the specific make command below: |
There is no need to use a separate library to replace C++ fstreams with C file streams, as C file streams work perfectly in Windows too. You are conflating two things in the comments now, which are using |
@oKatanaaa do you still want to move forward on this PR? I'm still not convinced that c++ stl vs. c stdio is going to make a measurable difference, unless our goal was to convert the c++ code to c. Our mmap() work on the other hand, makes weights load 10x faster for me on Windows now. And 100x faster on Linux. |
@anzz1 I looked at @CoderRC's project and it looks pretty good and it's unfair to say mmap() isn't portable because it's standardized by the IEEE. You'll also be happy to hear we now have our own working WIN32 mmap() polyfill, in our |
I still want to tinker with the code a bit and find the reasons of slow loading inside docker on win, but I am focusing on other things at the moment, so we can close the PR. I'll try to do some research regarding this problem once I have some time. If I find anything I'll make a new PR with new fixes. |
Tried to address slow weights loading. 7B is okay, but 13B is really slow (several minutes), hard to experiment/prototype with larger models.
Replaced
std::ifstream
with C-style file reading usingfopen
. Got a considerable boost in loading performance: 3x to 10x faster on my machine (measuring results were kinda inconsistent, but it is definitely a lot faster than before).I made sure the weights are correctly loaded: fixed the seed, gave the same prompts - model gives the same output, everything is good.
Also increased the buffer size from
1024*1024
to128*1024*1024
(see line 102) which gave a slight boost as well. Though I am not sure whether it is optimal for edge devices like Raspberry Pi (if it's of any concern).