Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leaks when uncompressing multi-volume archives #575

Open
aquirin opened this issue Mar 3, 2024 · 6 comments
Open

Memory leaks when uncompressing multi-volume archives #575

aquirin opened this issue Mar 3, 2024 · 6 comments
Labels
duplicate This issue or pull request already exists for extraction Issue on extraction, decompression or decryption help wanted Extra attention is needed Speed/Performance

Comments

@aquirin
Copy link

aquirin commented Mar 3, 2024

Describe the bug

It seems decompressing a multi-volume archive with a relatively large amount of files (321 "outer" volumes, 8000 compressed files, 400kb each, so 3 Gb in total) is producing some memory leaks.

The basic code which is failing is:

with multivolumefile.open(zip_path, mode='rb') as multizip_handler:
    with py7zr.SevenZipFile(multizip_handler, 'r') as zip_handler:
        for fname, fcontent in zip_handler.read(targets=None).items():
            count_files += 1

See complete function: uncompress.py.txt

The corresponding archive is a multi-volume archive of 8000 files, 400 kb per file, filled with random data, and splitted each 10 mb. No filters, specific headers, encryption or password have been set. The compression options have been set to the defaults.

A copy of the archive is available here: multi.zip. Please note that the first level needs to be uncompressed manually before the test. The actual archive to be tested is the folder with the 321 "7z" volumes.

Better, it is possible to reproduce this archive (modulo the random data) using the following code: compress.py.txt. Several tests indicates that the behavior is not related to the random content, only related to the size of the files.

If enough memory is available, the archive can be uncompressed without any issue. The process is still taking a lot of memory (ie, 3.3 gb of memory), which is not expected as each compressed file is quite small, and the uncompression script discards immediately any data on the fly.

$ ps up <pid>
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
<user>     <pid> 17.3 39.6 3420292 3206732 pts/1 S+   00:33   0:55 python3 code.py

If not enough memory is available, the uncompression script is crashing, with a CRC error (see log below) or a Bad7zFile: invalid header data. Actually, it seems the CRC error is only a consequence of the lack of memory, as the archive looks perfectly fine.

7z-crc-error.log

We can see the archive is error-free:

7z t test.7z.0001

[...]
Everything is Ok

Files: 8000
Size:       3201920000
Compressed: 3202141273

Note that for the purpose of tests, it is possible to deliberately fill the memory using commands such as: head -c 5G /dev/zero | tail

Related issue

These issues might be related to this one, but none of the existing tickets mention multivolume and OOM at the same time:

To Reproduce

  1. Download the archive from the Google Drive link, uncompress it to have a single folder with 321 7z files inside. Or better: use the compress.py.txt script to generate a random archive with the correct sizes.
  2. Run the following code with Python 3: uncompress.py.txt
  3. Run ps up <pid> in another terminal to see how memory is increasing

Expected behavior

Even if the archive has a total size of 3 gb, it is not expected that uncompressing it file by file, where each file is 400 kb, fills the memory. Uncompressing a multi-volume archive should have a very low memory footprint, as it should be possible to directly write the bytes on the disk, whatever size of the archive, size of individual files, amount of volumes or amount of compressed files we have in the archive.

Environment (please complete the following information):

  • OS:
  • Python
  • py7zr version:
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"

$ python --version
Python 3.8.0

$ pip freeze | grep py7zr
py7zr==0.21.0

$ pip freeze | grep multivolumefile
multivolumefile==0.2.3

Test data(please attach in the report):

See provided archive or script to generate it above.

Additional context

@aquirin aquirin changed the title Memory leaks when uncompressing multipart archives Memory leaks when uncompressing multi-volume archives Mar 3, 2024
@aquirin
Copy link
Author

aquirin commented Mar 3, 2024

Checking this a bit more, and it seems to me that the issues might reside here:

self._dict[fname] = _buf

Filling the dict in the loop inside the _extract function might prevent to have a low footprint for large amount of dict entries, even if we close the buffers or remove the dict entries later in the caller. Would it be possible to have a true iterator, using yield for instance?

Note that my code is running fine with extractall or extract instead of read, but these functions pass return_dict=False which does not fill any dict, thus saving memory.

@miurahr miurahr added help wanted Extra attention is needed for extraction Issue on extraction, decompression or decryption Speed/Performance labels Mar 4, 2024
@miurahr
Copy link
Owner

miurahr commented Apr 2, 2024

Duplicated with #579

@miurahr
Copy link
Owner

miurahr commented Oct 14, 2024

#620 removes the API which returns the dictionary. It will solve the problem here.

@miurahr
Copy link
Owner

miurahr commented Oct 14, 2024

You are welcome if you implement an API which use yeild for the alternative and send me a PR.

Would it be possible to have a true iterator, using yield for instance?

@miurahr
Copy link
Owner

miurahr commented Oct 14, 2024

readall API returns a dictionary with all the contents in the memory. It is not the LEAK. You ask py7zr to load all the content into memory, and ignore it. So it is not bug.

@aquirin
Copy link
Author

aquirin commented Oct 16, 2024

Actually yield could be complex to implement, and the new Writer class make this useless anyway. However I am still facing issues, I will comment in #620 instead, and you can mark this ticket as duplicated, as once #620 solved, you can consider this one solved too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists for extraction Issue on extraction, decompression or decryption help wanted Extra attention is needed Speed/Performance
Projects
None yet
Development

No branches or pull requests

2 participants