simple convert scripts run out of tmpfs space on Linux #3433

cebtenzzre · 2023-10-02T01:28:24Z

By default, Linux prevents tmpfs from using more than 50% of available system memory. This is normally a good thing, but the simple convert scripts write all tensor data to tmpfs before saving the output file, causing this exception if the converted model is larger than 50% of system RAM (ref):

Traceback (most recent call last):
  File "/home/cebtenzzre/src/forks/llama.cpp/convert-baichuan-hf-to-gguf.py", line 279, in <module>
    gguf_writer.add_tensor(new_name, data)
  File "/home/cebtenzzre/src/forks/llama.cpp/gguf-py/gguf/gguf.py", line 622, in add_tensor
    tensor.tofile(self.temp_file)
OSError: Not enough free space to write 140247040 bytes

This is annoying. You can set TMPDIR=/var/tmp to work around this, but then you need twice as much free disk space as the size of the output file - which I don't always have.

The smallest change that would fix this problem would be to provide a way to effectively disable use_temp_file on Linux, while still supporting use of e.g. /var/tmp if desired. That way, I could leverage 100% of my RAM, and my swap space, in order to convert these models. If we choose this route, we should make sure the converted tensor data is freed as it is written, to avoid unnecessary swapping - right now it is not.

We can't just change the simple scripts to convert on-the-fly, because they load one pytorch file at a time, so making two passes over the input tensors would have a high I/O cost.

We could make LazyUnpickler part of the gguf module. That way, we can significantly reduce memory usage, and avoid the need to load the entire model into memory, while hopefully still keeping the non-LLaMA conversion scripts relatively simple.

@ggerganov what do you think?

The text was updated successfully, but these errors were encountered:

staviq · 2023-10-02T14:50:20Z

If I'm reading the code right, that temp file is only being written to, no seeks so sequentially, until the point where it's rewound and copied to the output file.

If that's correct, as far as Linux is concerned, there shouldn't really be any benefit from using a tempfile for "write only" workload, since kernel caches already work in a similar way, and final write to file still takes as long as it takes.

If however, tempfile does in fact provide performance benefits, something else must be non optimal.

Python's open seems to allow for specifying buffer size, which isn't used in gguf.py, and as per docs, this means "default" os buffering is used, which is likely in order of kb.

I would imagine, specifying buffer size in open ( something like say 100M ) would likely allow to get rid of the tempfile.

KerfuffleV2 · 2023-10-02T15:33:48Z

Back when I did my cleanup stuff, I also added a feature to GGUFWriter where you can turn off using a temp file. Just pass use_temp_file = False to the constructor. I left it off by default just to avoid breaking stuff, I'm not 100% sure what the advantage of using the tempfile is.

Just for example, the GGML to GGUF converter script always sets that:

llama.cpp/convert-llama-ggml-to-gguf.py

Lines 236 to 241 in 9476b01

    
           def save(self): 
        
               print('* Preparing to save GGUF file') 
        
               gguf_writer = gguf.GGUFWriter( 
        
                   self.cfg.output, 
        
                   gguf.MODEL_ARCH_NAMES[gguf.MODEL_ARCH.LLAMA], 
        
                   use_temp_file = False )

cebtenzzre · 2023-10-02T18:37:04Z

I'm not 100% sure what the advantage of using the tempfile is.

The only reason to use a tempfile is on platforms other than Linux (*BSD, macOS, Windows) where the temp folder is usually not a ramdisk. This allows you to convert a model with an output size larger than your RAM+swap space. But on typical Linux installs, use_temp_file has no benefit unless you set TMPDIR to disk.

ggerganov · 2023-10-03T17:56:36Z

We could make LazyUnpickler part of the gguf module. That way, we can significantly reduce memory usage, and avoid the need to load the entire model into memory, while hopefully still keeping the non-LLaMA conversion scripts relatively simple.

This sounds like the best option, but I'm not sure how difficult it is to achieve and if we can implement a simple enough API to reuse it in the simple convert scripts

KerfuffleV2 · 2023-10-07T17:13:53Z

There's an approach that should mostly avoid the issues temp file and memory issues and that's to just numpy.memmap the files. PyTorch files are ZIP but they never actually compress the data, it's always just STOREed. numpy also supports mapping as copy-on-write, so you can do stuff like permute the tensor data without affecting the underlying file which can still be opened readonly. Ref: https://numpy.org/doc/stable/reference/generated/numpy.memmap.html

There is one limitation though, and that is on 32bit systems only files up to 2GB can be mmaped. Would that be acceptable?

cebtenzzre · 2023-10-07T20:57:38Z

There is one limitation though, and that is on 32bit systems only files up to 2GB can be mmaped. Would that be acceptable?

AFAIK python on 32-bit Linux is built with 64-bit file offsets (-D_FILE_OFFSET_BITS=64), so the limit is actually 4GB. That's also the practical limit for loading on CPU, so it's probably fine.

edit: we use mmap for safetensors files in convert.py anyway.

KerfuffleV2 · 2023-10-07T21:34:46Z

the limit is actually 4GB.

The numpy reference for memmap specifically says it only supports 2GB so it's possible their wrapper adds some extra restrictions. (But it would also be possible to bypass that and just use Python's mmap instead if necessary.)

The source for the function is here: https://github.com/numpy/numpy/blob/main/numpy/core/memmap.py - it's pretty short if you want to take a look and see if there's a reason it would only support 2GB on 32bit. (Kind of looks like the main reason to use that function is because it simplifies using offset, where with mmap it has to be a multiple of the page size usually.)

cebtenzzre · 2023-10-08T02:00:59Z

just numpy.memmap the files

Unfortunately, an approach like this would not work with convert.py because the original LLaMA tensors are split between multiple files, and must be concatenated. And I would rather not have different versions of LazyUnpickler between convert.py and the gguf package.

KerfuffleV2 · 2023-10-08T02:28:43Z

an approach like this would not work with convert.py because the original LLaMA tensors are split between multiple files, and must be concatenated.

Why not? Like I mentioned, it can be mmaped CoW so there aren't any restrictions to manipulating the data. The only thing I was thinking is it might be necessary to make that part more lazy (if it isn't already) because concatting/permuting all the tensors in advance would be basically the same as loading them into memory.

And I would rather not have different versions of LazyUnpickler between convert.py and the gguf package.

Sure, there should be only one version of it in one place. I'm not sure how that relates to loading the data via mmap instead of just reading it from a file though.

cebtenzzre · 2023-10-08T02:45:10Z

LazyUnpickler already waits until the last second to load tensor data from disk - it builds a chain of 'load' functions and then calls them when the data is needed. mmap would not be any simpler (it requires extra effort to find the offset in the zipfile, which is not directly exposed by the API) or save significant memory usage in convert.py as far as I can tell.

KerfuffleV2 · 2023-10-08T10:39:07Z

it requires extra effort to find the offset in the zipfile, which is not directly exposed by the API)

That's true, but it's not too hard since it does expose an offset to the local file header (you basically just have to find the name length and extra part length and the raw data will be right past that).

or save significant memory usage in convert.py as far as I can tell.

Well, you wouldn't run out of temp space because there'd no longer be a need to use that approach. Right?

Also when concatting tensors, the data from the first in the sequence wouldn't be written to so it would basically be the same as a read only mmap. I'm not sure if permuting will touch every element or not. The gguf interface could probably be changed a little to support taking a list of tensor arrays that need to be saved, that way it potentially wouldn't be necessary to concat everything into one big numpy array at all.

cebtenzzre · 2023-10-08T18:06:23Z

convert.py doesn't use tmpfs, it calls GGUFWriter.write_tensor_data directly. This issue refers to the simple scripts that call GGUFWriter.add_tensor.

Galunid · 2023-11-07T08:00:52Z

I wonder if a better temporary solution is to set dir to a model directory, rather than $TMPDIR. I think it's much more likely user's tmpfs won't be able to fit a whole model, whereas "normal" partition is much more likely to be able to fit 2*model size in a pessimistic case.

KerfuffleV2 · 2023-11-07T08:49:18Z

Python's temporary file handling stuff should respect environment variables like TMPDIR, so I think all you'd need to do is to add a little something to the docs to make it more obvious that large files will be created in the default temp directory and an explanation of how the user can change that if they want.

Galunid · 2023-11-07T09:18:28Z

I'm not sure if you understood what I meant. Python's SpooledTemporaryFile exposes dir argument in constructor, and allows for changing default behavior. I feel it's warranted in this case, since (unquantized) models usually are 9GB+. In most cases the whole tmpfs has ~8-16GB (default is 50% of ram excluding swap). I think it's better to overwrite the default, rather than let it fail. This issue has been pretty frustrating for me recently. When you convert larger models it usually fails near the end, after it run for a good 30-40 minutes, and you have to start from scratch. Of course the best solution is to use LazyUnpickler, but that's a more complex approach.

KerfuffleV2 · 2023-11-07T09:32:01Z

I'm not sure if you understood what I meant.

What I meant is there's already a convention for allowing users to set where temp file get stored, and SpooledTemporaryFile respects that as far as I know. So we don't need to override that behavior, we just need to make users aware that they may need to do that. See what I mean?

I actually think it would probably be weird to override that behavior and use some other temp directory.

cebtenzzre · 2023-11-07T16:53:54Z

The point of SpooledTemporaryFile is actually that dir is supposed to be on disk - it's designed to use memory until a certain threshold, then write to dir, which is assumed to have more space, not less. Otherwise, you would just use RAM and not use a file object at all.

TMPDIR is the default for consistency with e.g. NamedTemporaryFile, but on Linux we really should not spool to /tmp - we should use the current directory (like transformers does) or /var/tmp.

github-actions · 2024-04-03T01:15:27Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

cebtenzzre mentioned this issue Oct 18, 2023

StableLM support #3586

Merged

12 tasks

cebtenzzre mentioned this issue Nov 1, 2023

Generalize convert scripts #3838

Merged

19 tasks

cebtenzzre mentioned this issue Nov 9, 2023

GGUF file format specification ggerganov/ggml#302

Merged

github-actions bot added the stale label Mar 19, 2024

github-actions bot closed this as completed Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simple convert scripts run out of tmpfs space on Linux #3433

simple convert scripts run out of tmpfs space on Linux #3433

cebtenzzre commented Oct 2, 2023 •

edited

Loading

staviq commented Oct 2, 2023

KerfuffleV2 commented Oct 2, 2023 •

edited

Loading

cebtenzzre commented Oct 2, 2023 •

edited

Loading

ggerganov commented Oct 3, 2023

KerfuffleV2 commented Oct 7, 2023

cebtenzzre commented Oct 7, 2023 •

edited

Loading

KerfuffleV2 commented Oct 7, 2023 •

edited

Loading

cebtenzzre commented Oct 8, 2023

KerfuffleV2 commented Oct 8, 2023

cebtenzzre commented Oct 8, 2023

KerfuffleV2 commented Oct 8, 2023

cebtenzzre commented Oct 8, 2023

Galunid commented Nov 7, 2023

KerfuffleV2 commented Nov 7, 2023

Galunid commented Nov 7, 2023

KerfuffleV2 commented Nov 7, 2023

cebtenzzre commented Nov 7, 2023

github-actions bot commented Apr 3, 2024

simple convert scripts run out of tmpfs space on Linux #3433

simple convert scripts run out of tmpfs space on Linux #3433

Comments

cebtenzzre commented Oct 2, 2023 • edited Loading

staviq commented Oct 2, 2023

KerfuffleV2 commented Oct 2, 2023 • edited Loading

cebtenzzre commented Oct 2, 2023 • edited Loading

ggerganov commented Oct 3, 2023

KerfuffleV2 commented Oct 7, 2023

cebtenzzre commented Oct 7, 2023 • edited Loading

KerfuffleV2 commented Oct 7, 2023 • edited Loading

cebtenzzre commented Oct 8, 2023

KerfuffleV2 commented Oct 8, 2023

cebtenzzre commented Oct 8, 2023

KerfuffleV2 commented Oct 8, 2023

cebtenzzre commented Oct 8, 2023

Galunid commented Nov 7, 2023

KerfuffleV2 commented Nov 7, 2023

Galunid commented Nov 7, 2023

KerfuffleV2 commented Nov 7, 2023

cebtenzzre commented Nov 7, 2023

github-actions bot commented Apr 3, 2024

cebtenzzre commented Oct 2, 2023 •

edited

Loading

KerfuffleV2 commented Oct 2, 2023 •

edited

Loading

cebtenzzre commented Oct 2, 2023 •

edited

Loading

cebtenzzre commented Oct 7, 2023 •

edited

Loading

KerfuffleV2 commented Oct 7, 2023 •

edited

Loading