Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making weights loading faster #216

Closed
wants to merge 3 commits into from
Closed

Conversation

oKatanaaa
Copy link

Tried to address slow weights loading. 7B is okay, but 13B is really slow (several minutes), hard to experiment/prototype with larger models.

Replaced std::ifstream with C-style file reading using fopen. Got a considerable boost in loading performance: 3x to 10x faster on my machine (measuring results were kinda inconsistent, but it is definitely a lot faster than before).

I made sure the weights are correctly loaded: fixed the seed, gave the same prompts - model gives the same output, everything is good.

Also increased the buffer size from 1024*1024 to 128*1024*1024 (see line 102) which gave a slight boost as well. Though I am not sure whether it is optimal for edge devices like Raspberry Pi (if it's of any concern).

@gjmulder gjmulder added enhancement New feature or request performance Speed related topics labels Mar 17, 2023
Copy link

@Kangmo Kangmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

@sw
Copy link
Contributor

sw commented Mar 18, 2023

Is this faster than using mmap (see #91 / #150)?

@oKatanaaa
Copy link
Author

Is this faster than using mmap (see #91 / #150)?

Just tested out #150 implementation, it was a lot slower than simple fopen. I believe #150 shoud've been faster on subsequent runs, but I didn't see any perceptible speed up, maybe just a little.

@Piezoid
Copy link
Contributor

Piezoid commented Mar 18, 2023

Is this faster than using mmap (see #91 / #150)?

By implementing #91 correctly, we can prevent creating any redundant copies of the model. This is crucial because if we duplicate it and there's enough space to store just one instance in memory, we run the risk of removing it from the operating system's file system cache.

@apaz-cli
Copy link
Contributor

@oKatanaaa switching between std::ifstream and FILE* should make no measurable difference. They are both tunable, do conceptually the exact same thing, and support (almost) exactly the same set of operations. The main bottleneck is the time spent inside of system calls reading from disk, and they make those calls exactly or almost exactly the same way. The first implementations literally called fopen() directly. I forget if they still do.

Your testing methodology is probably at fault. Files that are opened once are likely to be opened again soon after. OSes like to cache the contents of files (and shared memory pages also) so that subsequent reads are faster. On linux you can flush these file and page caches with:

sync && echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null

You may have noticed that with the vanilla build on master loading the model sometimes skips ahead by quite a few dots in the progress bar, usually at the beginning, but sometimes a few times in the middle. This is why.

That said, I recommend that you don't do this. Justine's version of #150 is much better than mine. To use it, you can just clone the repo and switch to the mmap branch. If you load the model once, subsequent loads happen instantaneously, as long as the kernel hasn't had any reason to evict the model to free up space.

As discussed on #91, it's not a perfect solution, but it's pretty great. Actually properly fixing this problem will require making changes to the ggml file format, and possibly the library. I recommend using Justine's branch in the meantime.

@oKatanaaa
Copy link
Author

oKatanaaa commented Mar 18, 2023

@apaz-cli Thanks for the input. I initially thought the same because performance comparisons for fstream vs fopen I found were not really consistent (and very speculative) and depended on a particular case. Though intuition was telling me fopen should be faster for binaries (as I recall from my experience). So I decided to see if anything changes and implemented fopen model load. To my surprise, the performance gains were substantial, that's why I decided to make a pull request.

Speaking of the mmap implementation, as far as I understand (though I don't understand a thing about mmap really) it simply creates a large binary where model weights are stored contiguiosly (I guess?) and uses it to make faster subsequent loads. The model loading code is not changed at all, only the fancy memory allocation stuff, meaning that my change and Justine's code are perpendicual and can benefit each other.

Getting back to performance comparison of fstream vs fopen (and also mmap), I did some testing and here are results:

Implemetation First load Subsequent load Read speed (mb/s) on subsequent run
master 1m 49s 1m 50s 50
mmap 8m 50s 1m 10s 20-30
fopen 13s 13s 200-300
mmap + fopen 6m 20s 1m 8s 20-30

My machine:

  • CPU: Ryzen 7 4800H
  • RAM: 32GB, 3200MHz (not sure if it's dual channel)
  • SSD: ADATA SX6000PNP 1TB

Testing was done using 13B model. I measured time (both for first and subsequent loads) till the model processed the first token. I also monitored my disk current read/write speed. I made sure to run sync && echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null for each 'first load" and did several subsequent loads picking an average result. Deviations from run to run weren't too large, so the numbers in the table should tell a general picture of things.
I also implemented mmap + fopen by combining my code with the mmap implementation and it did give a slight boost. Though mmap implementation does a lot of writes to memory (page file I guess?) taking up a huge portion of all loading time, so on the first run we'll have to wait still.

An interesting thing is that subsequent runs on mmap still run a lot slower than fopen. I am not sure what's going on as I expected an instantaneous load (but intuition tells me that custom malloc and realloc read required tensors from disk on request as the model is being evaluated during the first run, please correct me if I'm wrong).

As a summary:

  1. fopen is indeed faster than fstream in this case, at least on my machine. It would be really good if other people could confirm performance gains on their machines.
  2. fopen (current PR code) does not conflict with the mmap implementation in any way and both can be merged bringing benefits of both sides.
  3. Current implementation of mmap may not work as expected (still slow subsequent runs), though there might be something up with my machine. Again, it'd be good to have other people verify these results.

@jart
Copy link
Contributor

jart commented Mar 18, 2023

When you say mmap, are you talking about #150? Because that was the first iteration of our mmap() work. The second iteration we did is in the https://github.com/ggerganov/llama.cpp/tree/mmap branch. That code doesn't have a load phase at all in subsequent runs.

It's also important to specify (1) what operating system you're using, and (2) what type of weights you're using. Even with the code in the master branch, the 13B model, with either q4 or f16 weights, your computer should not be taking 1+ minutes on subsequent runs. On my $1000 workstation, subsequent runs with the old method take ~3 seconds to load 13B F16. With the new method, it loads in 0 seconds. The only way I could see a computer taking >1m given your hardware, would be if it were running Windows, or if you had so many tabs open in your browser that there was practically no memory available for caches. For example, my workstation is headless and I use it via SSH, so that things like Chrome don't influence benchmarks.

@oKatanaaa
Copy link
Author

@jart I was using the mmap branch. I believe I earlier tried #150 and it didn't work for me well.

(1) I am on a Windows 11 machine, but building and running code inside a docker container (kitware/cmake:ci-debian10-x86_64-2023-03-08). Just didn't want to have headaches from win toolchain (as I never worked with cmake on win), so opted for the simplest route.
(2) Q4 weights.

Anyways, I tried building for win and to my surprise fopen and master implementations matched loading speeds (4 seconds, blazing fast). Now it completely agrees with @apaz-cli argument. It seems slow loading times are due to I/O problems inside the container and I am not sure why there is a such dramatic difference.

Sadly I couldn't build mmap for win.

A little bit of googling gave these issues where people were struggling with slow I/O for mounted volumes (though on Mac):

It also seems like the same issue might be present on Linux machines as well: https://forums.docker.com/t/docker-extremely-slow-on-linux-and-windows/129752

Seems like everything works well when building stuff 'natively'. So the benefits of this PR are not clear in native settings, but it might be an option in others (again, only to have other people try out the code and see if it works better for them).

@jart
Copy link
Contributor

jart commented Mar 19, 2023

It's good that you're running Windows, because where we really need your help right now is in making sure our mmap branch is able to build with MSVC. Neither @apaz-cli or myself have access to MSVC at the moment. I wrote some polyfills and pushed them to the branch that I believe will work with WIN32. However the change was written a priori and needs to be cleaned up and debugged by someone who has MSVC. Otherwise we won't be able to merge the improvements into the master branch.

We could still possibly find ways to improve the experience for people who are using virtual file systems like Docker. However I would want to see a better theory that explains why the performance is bad and why it's actionable on our part. We can't help you find those answers. For example, I'm not convinced that using C stdio rather than C++ stdio is itself in any way different. An example of something that would possibly convince me, would be like, "if we use this stride obtained from the statfs() block size then we see optimal i/o behavior" or "if we have a second thread prefaulting pages".

Another thing that would convince me is if you showed me a perf record profiling report showing, "here we see my glibc version of libstdc++ has a bug where it spends all its time chewing up CPU rather than actually doing I/O therefore we should just use C stdio." But until that happens, I can't merge a change like this.

@oKatanaaa
Copy link
Author

@jart I'd be glad to be of any help with debugging mmap for WIN32. What's needed right now is to clone mmap branch, (try to) compile it using MSVC and see if there are any bugs in there, is that right? And where do I report my findings? #91?

I will also try to do some investigation with regards to loading performance inside docker.

@apaz-cli
Copy link
Contributor

@oKatanaaa The branch is already in the repo. Just git pull origin and git checkout mmap.

@jart
Copy link
Contributor

jart commented Mar 19, 2023

You can report in #91. You can also send a pull request that targets the mmap branch.

Copy link
Contributor

@anzz1 anzz1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great !

However, merge conflicts require solving.

@CoderRC
Copy link

CoderRC commented Mar 28, 2023

I have created my own library that implements mmap using mingw32 that makes this project maintainable for windows. It is possible to compile the program using library from https://github.com/CoderRC/libmingw32_extended, make changes like in #564 and the specific make command below:
make LDFLAGS='-D_POSIX_MAPPED_FILES -lmingw32_extended'

@anzz1
Copy link
Contributor

anzz1 commented Mar 29, 2023

There is no need to use a separate library to replace C++ fstreams with C file streams, as C file streams work perfectly in Windows too. You are conflating two things in the comments now, which are using mmap() , which is non-portable and requires an external library or polyfill, while the PR is about using fopen() and fread() , both of which are portable and do not require any additions.

@anzz1 anzz1 mentioned this pull request Mar 29, 2023
4 tasks
@jart
Copy link
Contributor

jart commented Mar 29, 2023

@oKatanaaa do you still want to move forward on this PR? I'm still not convinced that c++ stl vs. c stdio is going to make a measurable difference, unless our goal was to convert the c++ code to c. Our mmap() work on the other hand, makes weights load 10x faster for me on Windows now. And 100x faster on Linux.

@jart
Copy link
Contributor

jart commented Mar 29, 2023

@anzz1 I looked at @CoderRC's project and it looks pretty good and it's unfair to say mmap() isn't portable because it's standardized by the IEEE. You'll also be happy to hear we now have our own working WIN32 mmap() polyfill, in our mmap branch: https://github.com/ggerganov/llama.cpp/blob/mmap/mmap.c Me and @oKatanaaa spent the past few weeks building that. As I mentioned in my earlier comment, it's going to help make loading larger models with multi-dimensional tensors (e.g. 13B) about 10x to 100x faster, in terms of wall time. It doesn't require any libraries or dependencies. You're encouraged to take a look at the code and share your feedback!

@oKatanaaa
Copy link
Author

@oKatanaaa do you still want to move forward on this PR? I'm still not convinced that c++ stl vs. c stdio is going to make a measurable difference, unless our goal was to convert the c++ code to c. Our mmap() work on the other hand, makes weights load 10x faster for me on Windows now. And 100x faster on Linux.

I still want to tinker with the code a bit and find the reasons of slow loading inside docker on win, but I am focusing on other things at the moment, so we can close the PR. I'll try to do some research regarding this problem once I have some time. If I find anything I'll make a new PR with new fixes.

@oKatanaaa oKatanaaa closed this Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Speed related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants