Memory leaks when uncompressing multi-volume archives #575

aquirin · 2024-03-03T00:23:07Z

Describe the bug

It seems decompressing a multi-volume archive with a relatively large amount of files (321 "outer" volumes, 8000 compressed files, 400kb each, so 3 Gb in total) is producing some memory leaks.

The basic code which is failing is:

with multivolumefile.open(zip_path, mode='rb') as multizip_handler:
    with py7zr.SevenZipFile(multizip_handler, 'r') as zip_handler:
        for fname, fcontent in zip_handler.read(targets=None).items():
            count_files += 1

See complete function: uncompress.py.txt

The corresponding archive is a multi-volume archive of 8000 files, 400 kb per file, filled with random data, and splitted each 10 mb. No filters, specific headers, encryption or password have been set. The compression options have been set to the defaults.

A copy of the archive is available here: multi.zip. Please note that the first level needs to be uncompressed manually before the test. The actual archive to be tested is the folder with the 321 "7z" volumes.

Better, it is possible to reproduce this archive (modulo the random data) using the following code: compress.py.txt. Several tests indicates that the behavior is not related to the random content, only related to the size of the files.

If enough memory is available, the archive can be uncompressed without any issue. The process is still taking a lot of memory (ie, 3.3 gb of memory), which is not expected as each compressed file is quite small, and the uncompression script discards immediately any data on the fly.

$ ps up <pid>
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
<user>     <pid> 17.3 39.6 3420292 3206732 pts/1 S+   00:33   0:55 python3 code.py

If not enough memory is available, the uncompression script is crashing, with a CRC error (see log below) or a Bad7zFile: invalid header data. Actually, it seems the CRC error is only a consequence of the lack of memory, as the archive looks perfectly fine.

7z-crc-error.log

We can see the archive is error-free:

7z t test.7z.0001

[...]
Everything is Ok

Files: 8000
Size:       3201920000
Compressed: 3202141273

Note that for the purpose of tests, it is possible to deliberately fill the memory using commands such as: head -c 5G /dev/zero | tail

Related issue

These issues might be related to this one, but none of the existing tickets mention multivolume and OOM at the same time:

To Reproduce

Download the archive from the Google Drive link, uncompress it to have a single folder with 321 7z files inside. Or better: use the compress.py.txt script to generate a random archive with the correct sizes.
Run the following code with Python 3: uncompress.py.txt
Run ps up <pid> in another terminal to see how memory is increasing

Expected behavior

Even if the archive has a total size of 3 gb, it is not expected that uncompressing it file by file, where each file is 400 kb, fills the memory. Uncompressing a multi-volume archive should have a very low memory footprint, as it should be possible to directly write the bytes on the disk, whatever size of the archive, size of individual files, amount of volumes or amount of compressed files we have in the archive.

Environment (please complete the following information):

OS:
Python
py7zr version:

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"

$ python --version
Python 3.8.0

$ pip freeze | grep py7zr
py7zr==0.21.0

$ pip freeze | grep multivolumefile
multivolumefile==0.2.3

Test data(please attach in the report):

See provided archive or script to generate it above.

Additional context

The text was updated successfully, but these errors were encountered:

aquirin · 2024-03-03T01:18:21Z

Checking this a bit more, and it seems to me that the issues might reside here:

py7zr/py7zr/py7zr.py

Line 588 in 6b253d1

self._dict[fname] = _buf

Filling the dict in the loop inside the _extract function might prevent to have a low footprint for large amount of dict entries, even if we close the buffers or remove the dict entries later in the caller. Would it be possible to have a true iterator, using yield for instance?

Note that my code is running fine with extractall or extract instead of read, but these functions pass return_dict=False which does not fill any dict, thus saving memory.

miurahr · 2024-04-02T07:05:03Z

Duplicated with #579

miurahr · 2024-10-14T04:12:34Z

#620 removes the API which returns the dictionary. It will solve the problem here.

miurahr · 2024-10-14T04:17:43Z

You are welcome if you implement an API which use yeild for the alternative and send me a PR.

Would it be possible to have a true iterator, using yield for instance?

miurahr · 2024-10-14T04:20:32Z

readall API returns a dictionary with all the contents in the memory. It is not the LEAK. You ask py7zr to load all the content into memory, and ignore it. So it is not bug.

aquirin · 2024-10-16T14:58:54Z

Actually yield could be complex to implement, and the new Writer class make this useless anyway. However I am still facing issues, I will comment in #620 instead, and you can mark this ticket as duplicated, as once #620 solved, you can consider this one solved too.

aquirin changed the title ~~Memory leaks when uncompressing multipart archives~~ Memory leaks when uncompressing multi-volume archives Mar 3, 2024

miurahr added help wanted Extra attention is needed for extraction Issue on extraction, decompression or decryption Speed/Performance labels Mar 4, 2024

miurahr added the duplicate This issue or pull request already exists label Oct 13, 2024

miurahr mentioned this issue Oct 13, 2024

refactor: drop read/readall method, and add feature to stream data into user object #620

Merged

axet mentioned this issue Dec 3, 2024

memory leaking while extracting 40GB archive #626

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leaks when uncompressing multi-volume archives #575

Memory leaks when uncompressing multi-volume archives #575

aquirin commented Mar 3, 2024 •

edited

Loading

aquirin commented Mar 3, 2024 •

edited

Loading

miurahr commented Apr 2, 2024

miurahr commented Oct 14, 2024

miurahr commented Oct 14, 2024

miurahr commented Oct 14, 2024

aquirin commented Oct 16, 2024

Memory leaks when uncompressing multi-volume archives #575

Memory leaks when uncompressing multi-volume archives #575

Comments

aquirin commented Mar 3, 2024 • edited Loading

aquirin commented Mar 3, 2024 • edited Loading

miurahr commented Apr 2, 2024

miurahr commented Oct 14, 2024

miurahr commented Oct 14, 2024

miurahr commented Oct 14, 2024

aquirin commented Oct 16, 2024

aquirin commented Mar 3, 2024 •

edited

Loading

aquirin commented Mar 3, 2024 •

edited

Loading