Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[User] Memory usage is extremely low when running 65b 4-bit models. (Only use 5GB) #864

Closed
stuxnet147 opened this issue Apr 9, 2023 · 22 comments

Comments

@stuxnet147
Copy link

stuxnet147 commented Apr 9, 2023

Dear llama.cpp team,

I am experiencing two issues with llama.cpp when using it with the following hardware:

CPU: Xeon Silver 4216 x 2ea
RAM: 383GB
GPU: RTX 3090 x 4ea

The first issue is that although the model requires a total of 41478.18 MB of memory, my machine only uses 5 GB of memory when running the model. I would like to know if this is normal behavior or if there is something wrong with other.

The second issue is related to the token generation speed of the model. Despite my powerful CPU, which consists of two Xeon Silver 4216 processors, I am only getting a token generation speed of 0.65/s. This speed seems slower than what I would expect from my hardware. Could you please advise on how to improve the token generation speed?

Here is the information you may need to help troubleshoot the issue:

[Software Env]

Python 3.9.16
Windows 10 21H2
oobabooga/text-generation-webui

[Output]


===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: Loading binary C:\Users\Lucy\.conda\envs\alpaca-serve\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll...
The following models are available:

1. alpaca-native
2. llama-30b-hf
3. llama-65b-hf
4. llama_cpp_65b
5. opt-1.3b
6. Salesforce_codegen-16B-multi
7. TianXxx_llama-65b-int4

Which one do you want to load? 1-7

4

Loading llama_cpp_65b...
llama.cpp weights detected: models\llama_cpp_65b\ggml-model-q4_0.bin

llama_model_load: loading model from 'models\llama_cpp_65b\ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 8192
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 64
llama_model_load: n_layer = 80
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 22016
llama_model_load: n_parts = 8
llama_model_load: type    = 4
llama_model_load: ggml map size = 38917.99 MB
llama_model_load: ggml ctx size = 201.25 KB
llama_model_load: mem required  = 41478.18 MB (+ 10240.00 MB per state)
llama_model_load: loading tensors from 'models\llama_cpp_65b\ggml-model-q4_0.bin'
llama_model_load: model size = 38917.53 MB / num tensors = 723
llama_init_from_file: kv self size  = 2560.00 MB
C:\Users\Lucy\AppData\Roaming\Python\Python39\site-packages\gradio\deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
  warnings.warn(value)
Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://f973be860f84965921.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces
Output generated in 71.19 seconds (0.65 tokens/s, 46 tokens, context 30)```
@akumaburn
Copy link

akumaburn commented Apr 9, 2023

@stuxnet147 There was a recent change made that allowed the usage of mmap in order to avoid loading the entire model into memory, this helped users with not enough memory to still be able to run larger models. However, it negatively affects users that do have enough memory to run the full model.

There is an option you can pass to the program to disable this:
--mlock

Though, even with this for the 65B model there may be slow performance, because llama.cpp doesn't make usage of the GPUs you've got. I'd suggest looking to a program that enables you to run models on your GPU such as https://github.com/nomic-ai/gpt4all

@stuxnet147
Copy link
Author

@stuxnet147 There was a recent change made that allowed the usage of nmap in order to avoid loading the entire model into memory, this helped users with not enough memory to still be able to run larger models. However, it negatively affects users that do have enough memory to run the full model.

There is an option you can pass to the program to disable this: --mlock

Though, even with this for the 65B model there may be slow performance, because llama.cpp doesn't make usage of the GPUs you've got. I'd suggest looking to a program that enables you to run models on your GPU such as https://github.com/nomic-ai/gpt4all

Can I use --mlock option in windows?

@akumaburn
Copy link

@stuxnet147 I'm not sure since I'm on Linux.. but it should still work, try passing that option to the binary and see if it complains.

@KASR
Copy link
Contributor

KASR commented Apr 9, 2023

Once #801 has been merged it should be possible to provide the --no-mmap option which load the full model as before the mmap implementation

@cmp-nct
Copy link
Contributor

cmp-nct commented Apr 9, 2023

@stuxnet147 There was a recent change .. However, it negatively affects users that do have enough memory to run the full model.

There is no negative impact. The current implementation of llama.cpp, no matter if using mmap or not, is not suitable for users without enough memory.
mmap() was implemented to use the OS memory management features which allow file level caching of the model.
no negative impact as long as you do not want to modify the readonly model.

To get to the question:
Your reported memory consumption is correct and wrong.
It does not attribute the model memory to the binary because it's not allocated memory by the process, it's allocated by the OS for the process.
So if it shows 6GB, that means your process consumes that amount of allocated memory.
If you spawn a second simultaneous process using the same model it will also consume it's own share, but the model will be shared among both.
It also means no loading time because it's already cached.

That's the big benefit of using mmap() over malloc().

Overall the memory consumption is the same as before.

@akumaburn
Copy link

akumaburn commented Apr 9, 2023

@cmp-nct

There is no negative impact. The current implementation of llama.cpp, no matter if using mmap or not, is not suitable for users without enough memory. mmap() was implemented to use the OS memory management features which allow file level caching of the model. no negative impact as long as you do not want to modify the readonly model.

To get to the question: Your reported memory consumption is correct and wrong. It does not attribute the model memory to the binary because it's not allocated memory by the process, it's allocated by the OS for the process. So if it shows 6GB, that means your process consumes that amount of allocated memory. If you spawn a second simultaneous process using the same model it will also consume it's own share, but the model will be shared among both. It also means no loading time because it's already cached.

That's the big benefit of using mmap() over malloc().

Overall the memory consumption is the same as before.

I believe you may be thinking of the scenario when one does random reads on disk, but in the case of llama.cpp (pre #613) - the entire model was loaded into memory upon initialization, at that point it is faster than mmap because mmap still has disk latency/performance to deal with, they are not the same, and mmap does come with a runtime performance penalty (post initialization step, which is faster with mmap) precisely because it doesn't load everything into ram.

Not sure what you mean by the OS memory vs the process memory, you can in fact load things that are larger than your available system memory by using mmap because it doesn't load everything in at once, only what is actually being read (which in the case of llama.cpp depends on the prompt) - you wouldn't be able to load it all at once though if that's what you meant when you said the memory consumption is the same as before - because in actually it wouldn't be (unless you were very unlucky or kept the conversation going for long) because most prompts don't use more than some fraction of the model's weights.

@cmp-nct
Copy link
Contributor

cmp-nct commented Apr 10, 2023

I might have overlooked something but the only thing that is different is that mmap does not preload the data, but that's a feature not a problem and it's a tiny change to preload it.
So yes, currently your model will be loaded from disk when it is accessed and from there on it's pure memory access.

You could add madvise(addr, length, POSIX_MADV_SEQUENTIAL | POSIX_MADV_WILLNEED); after the mmap() call to hint the OS on using the whole block.
Or you could replace the tensor seek with a read like
current: fin.seekg(offset + tensor_data_size);
new: fin.seekg(offset); fin.ignore(tensor_data_size);
It's untested though.

So no: mmap does not have to deal with more disk latency than no-mmap, it deals with exactly the same amount of latency just in a different distribution when that latency is experience. Meaning: it's happening at first inference once instead of on load.

Regarding OS memory management: you can not "load larger than RAM" into process memory. You can reserve process memory for more than you actually can load. That's quite a difference.
But if you try that you'll have swapping/pagefile horror in the background and the OS is not doing that very optimized.
It would swap out the model permanently while your inference moves through the tensors over and over again.
That could be optimized but I don't think it's a main goal of the project.

@cmp-nct
Copy link
Contributor

cmp-nct commented Apr 10, 2023

I just gave the preloading a test, it's a bit more than one line but works with mmap:
I tested it on Windows, it includes Linux code that should work.
On Linux you'll need unistd.h

void preload_mmap_file(void *addr, size_t length)
{
// Get the page size of the system
#if defined(_WIN32)
    SYSTEM_INFO si;
    GetSystemInfo(&si);
    long page_size = si.dwPageSize;
#else
    long page_size = sysconf(_SC_PAGE_SIZE); // in windows we can use GetSystemInfo:
#endif

    if (page_size == -1)
    {
        perror("sysconf");
        return;
    }

    // Loop over the mapped file, jumping by page size
    for (size_t i = 0; i < length; i += page_size)
    {
        // Dereference the pointer at each page boundary
        volatile char c = ((char *)addr)[i];
        // Force the compiler to not optimize the loop away:
        (void)c; // Use the value of 'c' to avoid compiler warnings and ensure the loop is not optimized away
     }
}

@akumaburn
Copy link

akumaburn commented Apr 10, 2023

So no: mmap does not have to deal with more disk latency than no-mmap, it deals with exactly the same amount of latency just in a different distribution when that latency is experience. Meaning: it's happening at first inference once instead of on load.

I understand what you're getting at, but the fact is the current reason mmap is being used is for sparse loading of the model, not for the concurrent loading scenario you described above. Each inference can read different weights in the model, so it's not just the first inference that gets hit with a disk performance/latency penalty, but each subsequent inference will load in different parts of the file and result in further disk access. Until the entire file is loaded into memory by mmap, it will be slower than loading straight from ram (at least for inferences).

@cmp-nct
Copy link
Contributor

cmp-nct commented Apr 10, 2023

So no: mmap does not have to deal with more disk latency than no-mmap, it deals with exactly the same amount of latency just in a different distribution when that latency is experience. Meaning: it's happening at first inference once instead of on load.

I understand what you're getting at, but the fact is the current reason mmap is being used is for sparse loading of the model, not for the concurrent loading scenario you described above. Each inference can read different weights in the model, so it's not just the first inference that gets hit with a disk performance/latency penalty, but each subsequent inference will load in different parts of the file and result in further disk access. Until the entire file is loaded into memory by mmap, it will be slower than loading straight from ram (at least for inferences).

I recall a reddit discussion where jart though the model is sparse, but that was a fallacy from misunderstanding the low process memory consumption. Or do you refer to something else ?
Just like here, the memory reported is correct. It just does not include the mapped model.

mmap() comes with a lot of improvements for llama.cpp but "sparse inference" is not one of those.
The above code is functional but comes with a loading performance hit, I've updated a pull request if you want to try a fast one. Though the new method is not yet tested on Linux.

@akumaburn
Copy link

akumaburn commented Apr 11, 2023

mmap() comes with a lot of improvements for llama.cpp but "sparse inference" is not one of those.

Not sure what you mean, in my some odd hours of testing it with 30b/65b, that is exactly what it does. Are you speaking from theory or do you have evidence?

To be clear I'm on Linux; I'm not sure if that makes a difference here.

@cmp-nct
Copy link
Contributor

cmp-nct commented Apr 11, 2023

mmap() comes with a lot of improvements for llama.cpp but "sparse inference" is not one of those.

Not sure what you mean, in my some odd hours of testing it with 30b/65b, that is exactly what it does. Are you speaking from theory or do you have evidence?

To receive one single output token you need to load every single byte of the weights from disk.
If you think that's not true I'd like to be pointed to the location where any can be skipped.
Afaik the whole "llama is sparse" came solely from that reddit or twitter where jart was talking about mmap(), there is no actual factual background to it (that I am aware about).

Linux/Windows makes no difference, it's the same model architecture.

@akumaburn
Copy link

akumaburn commented Apr 11, 2023

@cmp-nct Consider that if that were true mmap would copy the entire model into memory upon the first inference, which doesn't currently happen. You can verify this very easily by looking at the total system memory usage (including swap). It is significantly lower with mmap vs without even after the first inference.

See: #638 (comment)

@cmp-nct
Copy link
Contributor

cmp-nct commented Apr 11, 2023

@cmp-nct Consider that if that were true mmap would copy the entire model into memory upon the first inference, which doesn't currently happen. You can verify this very easily by looking at the total system memory usage (including swap). It is significantly lower with mmap vs without even after the first inference.

See: #638 (comment)

No, I can not confirm that. The system memory consumption is 100% of the model before you see the first token appear.
Your quoted comment just confirms what I said, jart claimed the model to be sparse but that's just misunderstanding memory readings and how the model works. I didn't see that comment before but similar on twitter when I followed the drama unfold.

P.S. I'd be happy to be proven wrong, always glad to learn about a mistake. But that needs to point to an actual part of the code. The inference code is just a page, if you ignore the ggml background.

P.P.S. If you really see disk loading after the first inference this would indicate a deeper flaw and serious performance impact.
That's something mmap() comes with, you lose control over the memory region. The OS can decide to throw a part away right before you access it.

@akumaburn
Copy link

akumaburn commented Apr 11, 2023

@cmp-nct
Would take some time to know the actual reason, but if you look at the utility @jart wrote with the number of page faults it seems that the model isn't using the all the weights..

Now I may be mistaken about it loading additional weights with each additional inference; it could be that there is always some weights that aren't being used, but by default are being copied into memory anyways as part of the model loading process.

This seems to be the behavior on my system (running arch), for example, using the 65B model after first inference - total system memory:

Without mmap:
image

With mmap:
image

@j-f1
Copy link
Collaborator

j-f1 commented Apr 11, 2023

One possibility for why the fault rates are lower could be that the OS loads in multiple pages when a page fault is triggered, so fewer faults occur but the same amount of data is still loaded in. (That doesn’t explain your RAM usage chart though)

@cmp-nct
Copy link
Contributor

cmp-nct commented Apr 11, 2023

@cmp-nct Would take some time to know the actual reason, but if you look at the utility @jart wrote with the number of page faults used it seems that the model isn't using the all the weights..

Now I may be mistaken about it loading additional weights with each additional inference; it could be that there is always some weights that aren't being used, but by default were being copied into memory anyways as part of the model loading process.

This seems to be the behavior on my system (running arch), for example, using the 65B model after first inference - total system memory:

Without mmap: image

With mmap: image

Yea it's just a memory monitoring glitch or bug.
If the tool is not bugged and I should guess, don't hold me on that, I'd say the OS is marking a part of the memory as "discarded" despite being used within moments ago.
So during inference, what happens is that the OS "finds" the discarded part mostly unharmed and marks it as "in use" again without accessing the disk. It's a concerning problem with mmap.

I should add: I have seen such behavior during development of my linked commit.
In that case on Windows PrefetchVM is used to preload (most) of the weights, the memory shoots up during that and within half a second the OS decides "oops" and the memory is down again to where it was before.
However, when you access it anyway then the memory is found at very high speed with low disk access. So the memory was there, it was just marked as "free" but it was still ready to be associated again.
When combined with my preloading that behavior stops (on Windows), but it took me almost 3 hours to do that.
mmap() comes with such issues, but it's the only method we have to share memory between process runs.

@akumaburn
Copy link

@cmp-nct I'm not sure if that explanation holds, because even during the token generation, when the inference is still in progress, the memory usage is like above, unless you're implying its being allocated/discarded faster than the tool can detect.

@cmp-nct
Copy link
Contributor

cmp-nct commented Apr 11, 2023

@cmp-nct I'm not sure if that explanation holds, because even during the token generation, when the inference is still in progress, the memory usage is like above, unless you're implying its being allocated/discarded faster than the tool can detect.

Yes I say the tool is either faulty or too slow to report such stuff. Or both.
Also it's Linux, maybe there is a general problem with such memory reporting, that's not impossible. So the kernel might have an issue.
It sounds unlikely but the vast majority of memory concerns are related to allocated process memory, that's where most focus is on.
Like we see on Windows, the reporting behavior is not reliable.

@akumaburn
Copy link

akumaburn commented Apr 11, 2023

@cmp-nct Hmm, wouldn't mmap be significantly slower then on inference?
Given that it would have to do a memcpy (allocation / copy / deallocation) for each block being read every time it is read?

I have noticed some slow down but it isn't in the order of magnitude we'd expect when going from 30GB/s ram to 3GB/s SSD.

All of this is just theory though, it would be interesting to see the actual numbers on how many allocate/deallocates are being done per nanosecond with mmap.

@cmp-nct
Copy link
Contributor

cmp-nct commented Apr 11, 2023

@cmp-nct Hmm, wouldn't mmap be significantly slower then on inference? Given that it would have to do a memcpy (allocation / copy / deallocation) for each block being read every time it is read?

No memory copy is involved. All the OS does is set flags, that's my guess.
It just sets a bit or byte that "page 1-2 million" are free, your tool will report free space.
Now if you happen to access that region the management will see that the "free" page is still untouched and sets the flag back to "ready".
At kernel level that's a tiny fraction of a second to mark all pages one way or another. Another option is wrong reporting.
On Windows there is no API to monitor that (and it would be futile to do it) on Linux I recall there are ways to check if a virtual addressed page is ready or not but that's probably slower than just reading it while triggering the allocation backend that way.

In any case, you sadly do not have a 60% sparse model.
If you are still doubting it, just make a test. Make it generate 1 token from a prompt with 1 token and use something like "iostat" to see how much your SSD was accessed. You'll see that the entire model was read.

@akumaburn
Copy link

akumaburn commented Apr 11, 2023

@cmp-nct What about the cache of mmap? To my understanding it does cache whatever you did read in memory so that the next read doesn't need to re-fetch from disk.

"An application can determine which pages of a mapping are
currently resident in the buffer/page cache using mincore(2)."

https://man7.org/linux/man-pages/man2/mmap.2.html

Though it may be specific to Linux? (https://biriukov.dev/docs/page-cache/2-essential-page-cache-theory/)

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants