Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve memory usage for archived or compressed files #182

Open
jtmoon79 opened this issue Aug 29, 2023 · 1 comment
Open

improve memory usage for archived or compressed files #182

jtmoon79 opened this issue Aug 29, 2023 · 1 comment
Labels
code improvement enhancement not seen by the user difficult A difficult problem; a major coding effort or difficult algorithm to perfect enhancement New feature or request P1 important

Comments

@jtmoon79
Copy link
Owner

jtmoon79 commented Aug 29, 2023

Summary

This is a "meta issue" to link to specific issues around a chronic issue.

Current behavior

Currently, some files are entirely read into memory. When dealing with large files and/or many files then the process can use too much memory and possibly crash due to Out Of Memory errors (OOM).

Reading from a large amount of archived files was the original motivating use-case for this project, i.e. many "tars of logs" from related systems. It's too bad this use-case can cause OOMs.

Specifically, these circumstances read much or all of a file into a memory at one time (memory use is O(n))


XZ

.xz files must be entirely read into memory during Blockreader::new

This is due to lzma_rs::xz_decompress reading the entire file in one function call. There is not an API that reads chunks of data for some requested amount of bytes.

Noted in gendx/lzma-rs#110

See #12


BZ2

The entire .bz2 file is uncompressed to get it's uncompressed size before processing.

See #300


LZ4

The entire .lz4 file is uncompressed to get it's uncompressed size before processing.

See #293


GZ

Fixed in 02261be See https://github.com/jtmoon79/super-speedy-syslog-searcher/blob/0.7.73/src/readers/blockreader.rs#L2861-L2884

.gz files must be entirely read into memory up to the requested offset. This may be a problem when a user passes a datetime filter --dt-after for a large file. The searching algorithm for plain text files is a binary search algorithm. That searching algorithm is used for the decompressed data. The search for the first datetime after that --dt-after value may require reading from disk at least half of the uncompressed file (if the first datetime in block 0 is before the passed --dt-after value). However, before that search among the decompressed data can occur, the all data prior to the requested file offset must be read (and is held in memory). This is due to how gzip compresses data as a stream. In other words, you cannot decompress a block of compressed data at an arbitrary offset without first decompressing all preceding blocks of compressed data in the "gzip data stream".

If a user does not pass --dt-after then gzip decompressed data is read from block offset 0 and then as further blocks are read, old blocks are dropped (so memory usage does not scale to the size of the file).


TAR

.tar file contained file must be entirely read into memory during the first call to BlockReader::read_block_FileTar, e.g. the entire file syslog from logs.tar but not the entire file logs.tar.

See #13


EVTX

*.evtx files must be entirely read into memory due to "out of order" events.

Relates to #86


Suggested behavior

Have some O(1) ceiling for memory usage for all cases.

Relates to #14

@jtmoon79 jtmoon79 added bug Something isn't working enhancement New feature or request P1 important difficult A difficult problem; a major coding effort or difficult algorithm to perfect labels Aug 29, 2023
@jtmoon79 jtmoon79 changed the title improve memory usage for large files improve memory usage for use-case of many archived files Aug 29, 2023
@jtmoon79 jtmoon79 changed the title improve memory usage for use-case of many archived files improve memory usage for archived or compressed files Aug 30, 2023
@jtmoon79
Copy link
Owner Author

jtmoon79 commented Aug 31, 2023

The current implementation for xz files

SyslineReader::find_sysline_at_datetime_filter binary search

The SyslogProcessor::find_sysline_between_datetime_filters calls SyslineReader::find_sysline_at_datetime_filter
which does a binary search over the entire file during the "find datetime" stage (Stage2FindDt). This stage occurs before any log messages are printed to the console (during stage Stage3StreamSyslines).
The intention here is that when a user passes a --dt-after value then the first qualifying value can be found quickly by skipping the reading and processing of unnecessary data (any syslines before the --dt-after datetime).

This was one of the original insights that motivated writing a custom BlockReader and it's attendant classes; very large log files read over slow communication links could be search very quickly for log messages within some requested period of time.

underlying BlockReader and drops

The underlying BlockReader does not know which stage is currently in progress (as it should not; stages are only a concept to the SyslogProcessor). So the BlockReader does not know when it should drop any held blocks. Nor could the BlockReader decide to drop it's own blocks because the user of the BlockReader, the LineReader holds BlockP points to the Blocks held by the BlockReader. In other words, even if the BlockReader decided to be clever and drop some of the Blocks it is storing, there would still be pointers to those Blocks (an Arc<Vec<u8>>) held by the LineReader instance. And the BlockReader should not try to get into signaling the LineReader to drop those BlockP pointers (a BlockReader should know nothing about a LineReader).

Signaling the BlockReader to drop data is currently initiated by the driver function in bin.rs, exec_syslogprocessor.
The calls to SyslogProcessor::drop_data_try is done during the log printing stage Stage3StreamSyslines.

review of current implementation

To review: during the binary search find_sysline_at_datetime_filter done early on during stage Stage2FindDt (again, only needed if the user passes --dt-after) there are no attempts to tell the BlockReader to drop old data.
And during this stage, it is possible for an entire .xz file or an entire .gz file to be held in memory if, for example, the user passes --dt-after datetime value that is after all log messages in the decompressed log data. Or, if a user passes a --dt-after datetime value that is after the first sysline of the file then the binary search of find_sysline_at_datetime_filter will jump to the middle of the file.

Idea: BlockReader only holds N blocks at a time

So the next idea would be to modify the BlockReader so that it only ever holds some N amount of Blocks

Problem

But that's difficult because remember the controlling LineReader hold it's own copies of pointers to those Blocks as well (BlockP instances) so the underlying Vec<u8> wouldn't drop as there are outstanding pointers to it. Additionally, it would be a bad design to try to have the BlockReader callback to the LineReader (now the BlockReader needs to understand what a LineReader is and that get's too messy).

Idea: BlockReader proactively drops blocks when handling xz files

Another idea is to modify the BlockReader::read_block_FileXz such that it proactively drops blocks that are not needed. Function read_block_FileXz does a linear search, storing Blocks as they are decompressed, and passes back a single BlockP for requested Block at BlockOffset. So the BlockReader could quietly drop "old" Blocks of data as new Blocks are decompressed during the read_block_FileXz function.

For example, given a call to read_block_FileXz(9) to read the tenth Block of data, that requires decompressing all prior Blocks of data (as xz algorithm can only linearly stream compression/decompression). The function read_block_FileXz could be modified such that after Block two is read then Block one is dropped, and so on, up to the tenth Block (so the prior nine decompressed Blocks would have been dropped).

Problem

But then the problem is another call from the overlying SyslineProcessor within binary search function find_sysline_at_datetime_filter may later call read_block_FileXz(5). If Blocks one through six were dropped previously, then the entire file must read and decompressed again (reading the sixth Block requires reading the prior five Blocks). So the initial binary search with O(log(n)) time now becomes O(n*log(n)) time.

Idea: .xz files are only searched linearly

If the binary search during stage Stage2FindDt within SyslineProcessor::find_sysline_at_datetime_filter is instead a linear search then
the underlying BlockReader could be modified to only keep N Blocks at a time.
This would satisfy the .xz decompression need for streaming the data (reading all preceding data linearly).
He same could be done for .gz files.

jtmoon79 added a commit that referenced this issue Sep 5, 2023
Always read gz files sequentially. Do not do a binary search
within SyslineReader::find_sysline_at_datetime_filter for compressed files.

Within BlockReader::read_block_Gz, proactively drop prior read blocks. The
"high water" mark for Blocks held in memory at one time will be a constant
value instead of scaling to the size of the file.

Issue #182
jtmoon79 added a commit that referenced this issue Jun 23, 2024
Read all blocks from a .tar file. This reverts to prior behavior.

Issue #13
Issue #182
@jtmoon79 jtmoon79 added code improvement enhancement not seen by the user and removed bug Something isn't working labels Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code improvement enhancement not seen by the user difficult A difficult problem; a major coding effort or difficult algorithm to perfect enhancement New feature or request P1 important
Projects
None yet
Development

No branches or pull requests

1 participant