-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve memory usage for archived or compressed files #182
Comments
The current implementation for xz files
|
Always read gz files sequentially. Do not do a binary search within SyslineReader::find_sysline_at_datetime_filter for compressed files. Within BlockReader::read_block_Gz, proactively drop prior read blocks. The "high water" mark for Blocks held in memory at one time will be a constant value instead of scaling to the size of the file. Issue #182
Summary
This is a "meta issue" to link to specific issues around a chronic issue.
Current behavior
Currently, some files are entirely read into memory. When dealing with large files and/or many files then the process can use too much memory and possibly crash due to Out Of Memory errors (OOM).
Reading from a large amount of archived files was the original motivating use-case for this project, i.e. many "tars of logs" from related systems. It's too bad this use-case can cause OOMs.
Specifically, these circumstances read much or all of a file into a memory at one time (memory use is O(n))
XZ
.xz
files must be entirely read into memory duringBlockreader::new
This is due to
lzma_rs::xz_decompress
reading the entire file in one function call. There is not an API that reads chunks of data for some requested amount of bytes.Noted in gendx/lzma-rs#110
See #12
BZ2
The entire
.bz2
file is uncompressed to get it's uncompressed size before processing.See #300
LZ4
The entire
.lz4
file is uncompressed to get it's uncompressed size before processing.See #293
GZ
Fixed in 02261be See https://github.com/jtmoon79/super-speedy-syslog-searcher/blob/0.7.73/src/readers/blockreader.rs#L2861-L2884
.gz
files must be entirely read into memory up to the requested offset. This may be a problem when a user passes a datetime filter--dt-after
for a large file. The searching algorithm for plain text files is a binary search algorithm. That searching algorithm is used for the decompressed data. The search for the first datetime after that--dt-after
value may require reading from disk at least half of the uncompressed file (if the first datetime in block 0 is before the passed--dt-after
value). However, before that search among the decompressed data can occur, the all data prior to the requested file offset must be read (and is held in memory). This is due to how gzip compresses data as a stream. In other words, you cannot decompress a block of compressed data at an arbitrary offset without first decompressing all preceding blocks of compressed data in the "gzip data stream".If a user does not pass--dt-after
then gzip decompressed data is read from block offset 0 and then as further blocks are read, old blocks are dropped (so memory usage does not scale to the size of the file).TAR
.tar
file contained file must be entirely read into memory during the first call toBlockReader::read_block_FileTar
, e.g. the entire filesyslog
fromlogs.tar
but not the entire filelogs.tar
.See #13
EVTX
*.evtx
files must be entirely read into memory due to "out of order" events.Relates to #86
Suggested behavior
Have some O(1) ceiling for memory usage for all cases.
Relates to #14
The text was updated successfully, but these errors were encountered: