improve memory usage for archived or compressed files #182

jtmoon79 · 2023-08-29T05:17:18Z

Summary

This is a "meta issue" to link to specific issues around a chronic issue.

Current behavior

Currently, some files are entirely read into memory. When dealing with large files and/or many files then the process can use too much memory and possibly crash due to Out Of Memory errors (OOM).

Reading from a large amount of archived files was the original motivating use-case for this project, i.e. many "tars of logs" from related systems. It's too bad this use-case can cause OOMs.

Specifically, these circumstances read much or all of a file into a memory at one time (memory use is O(n))

XZ

.xz files must be entirely read into memory during Blockreader::new

This is due to lzma_rs::xz_decompress reading the entire file in one function call. There is not an API that reads chunks of data for some requested amount of bytes.

Noted in gendx/lzma-rs#110

See #12

BZ2

The entire .bz2 file is uncompressed to get it's uncompressed size before processing.

See #300

LZ4

The entire .lz4 file is uncompressed to get it's uncompressed size before processing.

See #293

GZ

Fixed in 02261be See https://github.com/jtmoon79/super-speedy-syslog-searcher/blob/0.7.73/src/readers/blockreader.rs#L2861-L2884

.gz files must be entirely read into memory up to the requested offset. This may be a problem when a user passes a datetime filter --dt-after for a large file. The searching algorithm for plain text files is a binary search algorithm. That searching algorithm is used for the decompressed data. The search for the first datetime after that --dt-after value may require reading from disk at least half of the uncompressed file (if the first datetime in block 0 is before the passed --dt-after value). However, before that search among the decompressed data can occur, the all data prior to the requested file offset must be read (and is held in memory). This is due to how gzip compresses data as a stream. In other words, you cannot decompress a block of compressed data at an arbitrary offset without first decompressing all preceding blocks of compressed data in the "gzip data stream".

If a user does not pass --dt-after then gzip decompressed data is read from block offset 0 and then as further blocks are read, old blocks are dropped (so memory usage does not scale to the size of the file).

TAR

.tar file contained file must be entirely read into memory during the first call to BlockReader::read_block_FileTar, e.g. the entire file syslog from logs.tar but not the entire file logs.tar.

See #13

EVTX

*.evtx files must be entirely read into memory due to "out of order" events.

Relates to #86

Suggested behavior

Have some O(1) ceiling for memory usage for all cases.

Relates to #14

The text was updated successfully, but these errors were encountered:

jtmoon79 · 2023-08-31T05:23:23Z

The current implementation for xz files

`SyslineReader::find_sysline_at_datetime_filter` binary search

The SyslogProcessor::find_sysline_between_datetime_filters calls SyslineReader::find_sysline_at_datetime_filter
which does a binary search over the entire file during the "find datetime" stage (Stage2FindDt). This stage occurs before any log messages are printed to the console (during stage Stage3StreamSyslines).
The intention here is that when a user passes a --dt-after value then the first qualifying value can be found quickly by skipping the reading and processing of unnecessary data (any syslines before the --dt-after datetime).

This was one of the original insights that motivated writing a custom BlockReader and it's attendant classes; very large log files read over slow communication links could be search very quickly for log messages within some requested period of time.

underlying `BlockReader` and drops

The underlying BlockReader does not know which stage is currently in progress (as it should not; stages are only a concept to the SyslogProcessor). So the BlockReader does not know when it should drop any held blocks. Nor could the BlockReader decide to drop it's own blocks because the user of the BlockReader, the LineReader holds BlockP points to the Blocks held by the BlockReader. In other words, even if the BlockReader decided to be clever and drop some of the Blocks it is storing, there would still be pointers to those Blocks (an Arc<Vec<u8>>) held by the LineReader instance. And the BlockReader should not try to get into signaling the LineReader to drop those BlockP pointers (a BlockReader should know nothing about a LineReader).

Signaling the BlockReader to drop data is currently initiated by the driver function in bin.rs, exec_syslogprocessor.
The calls to SyslogProcessor::drop_data_try is done during the log printing stage Stage3StreamSyslines.

review of current implementation

To review: during the binary search find_sysline_at_datetime_filter done early on during stage Stage2FindDt (again, only needed if the user passes --dt-after) there are no attempts to tell the BlockReader to drop old data.
And during this stage, it is possible for an entire .xz file or an entire .gz file to be held in memory if, for example, the user passes --dt-after datetime value that is after all log messages in the decompressed log data. Or, if a user passes a --dt-after datetime value that is after the first sysline of the file then the binary search of find_sysline_at_datetime_filter will jump to the middle of the file.

Idea: BlockReader only holds N blocks at a time

So the next idea would be to modify the BlockReader so that it only ever holds some N amount of Blocks

Problem

But that's difficult because remember the controlling LineReader hold it's own copies of pointers to those Blocks as well (BlockP instances) so the underlying Vec<u8> wouldn't drop as there are outstanding pointers to it. Additionally, it would be a bad design to try to have the BlockReader callback to the LineReader (now the BlockReader needs to understand what a LineReader is and that get's too messy).

Idea: BlockReader proactively drops blocks when handling `xz` files

Another idea is to modify the BlockReader::read_block_FileXz such that it proactively drops blocks that are not needed. Function read_block_FileXz does a linear search, storing Blocks as they are decompressed, and passes back a single BlockP for requested Block at BlockOffset. So the BlockReader could quietly drop "old" Blocks of data as new Blocks are decompressed during the read_block_FileXz function.

For example, given a call to read_block_FileXz(9) to read the tenth Block of data, that requires decompressing all prior Blocks of data (as xz algorithm can only linearly stream compression/decompression). The function read_block_FileXz could be modified such that after Block two is read then Block one is dropped, and so on, up to the tenth Block (so the prior nine decompressed Blocks would have been dropped).

Problem

But then the problem is another call from the overlying SyslineProcessor within binary search function find_sysline_at_datetime_filter may later call read_block_FileXz(5). If Blocks one through six were dropped previously, then the entire file must read and decompressed again (reading the sixth Block requires reading the prior five Blocks). So the initial binary search with O(log(n)) time now becomes O(n*log(n)) time.

Idea: .xz files are only searched linearly

If the binary search during stage Stage2FindDt within SyslineProcessor::find_sysline_at_datetime_filter is instead a linear search then
the underlying BlockReader could be modified to only keep N Blocks at a time.
This would satisfy the .xz decompression need for streaming the data (reading all preceding data linearly).
He same could be done for .gz files.

Always read gz files sequentially. Do not do a binary search within SyslineReader::find_sysline_at_datetime_filter for compressed files. Within BlockReader::read_block_Gz, proactively drop prior read blocks. The "high water" mark for Blocks held in memory at one time will be a constant value instead of scaling to the size of the file. Issue #182

Read all blocks from a .tar file. This reverts to prior behavior. Issue #13 Issue #182

jtmoon79 added bug Something isn't working enhancement New feature or request P1 important difficult A difficult problem; a major coding effort or difficult algorithm to perfect labels Aug 29, 2023

jtmoon79 changed the title ~~improve memory usage for large files~~ improve memory usage for use-case of many archived files Aug 29, 2023

jtmoon79 changed the title ~~improve memory usage for use-case of many archived files~~ improve memory usage for archived or compressed files Aug 30, 2023

jtmoon79 mentioned this issue Sep 5, 2023

Expose API for raw decoder primitives for LZMA and LZMA2. gendx/lzma-rs#74

Merged

1 task

jtmoon79 mentioned this issue Oct 21, 2023

Tests test_new_read_block_gz_* fail #201

Closed

jtmoon79 mentioned this issue Apr 20, 2024

refactor datetime searching and file processing to support "forward seek" mode or "random seek" mode #283

Closed

jtmoon79 mentioned this issue Jun 20, 2024

entire .bz2 files must be read before processing #300

Open

jtmoon79 added a commit that referenced this issue Jun 23, 2024

(LIB) blockreader.rs read all blocks from .tar file

4a1b6a4

Read all blocks from a .tar file. This reverts to prior behavior. Issue #13 Issue #182

This was referenced Jun 23, 2024

read tar file by requested block #13

Open

handle evtx Events not in chronological order #86

Closed

jtmoon79 added code improvement enhancement not seen by the user and removed bug Something isn't working labels Jun 24, 2024

This was referenced Jul 16, 2024

entire .lz4 files must be read before processing #293

Open

read .xz file by requested block #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve memory usage for archived or compressed files #182

improve memory usage for archived or compressed files #182

jtmoon79 commented Aug 29, 2023 •

edited

Loading

jtmoon79 commented Aug 31, 2023 •

edited

Loading

improve memory usage for archived or compressed files #182

improve memory usage for archived or compressed files #182

Comments

jtmoon79 commented Aug 29, 2023 • edited Loading

Summary

Current behavior

XZ

BZ2

LZ4

GZ

TAR

EVTX

Suggested behavior

jtmoon79 commented Aug 31, 2023 • edited Loading

The current implementation for xz files

SyslineReader::find_sysline_at_datetime_filter binary search

underlying BlockReader and drops

review of current implementation

Idea: BlockReader only holds N blocks at a time

Problem

Idea: BlockReader proactively drops blocks when handling xz files

Problem

Idea: .xz files are only searched linearly

jtmoon79 commented Aug 29, 2023 •

edited

Loading

jtmoon79 commented Aug 31, 2023 •

edited

Loading

`SyslineReader::find_sysline_at_datetime_filter` binary search

underlying `BlockReader` and drops

Idea: BlockReader proactively drops blocks when handling `xz` files