refactor datetime searching and file processing to support "forward seek" mode or "random seek" mode #283

jtmoon79 · 2024-04-20T20:16:40Z

Current Behavior

This project's original design used a binary search algorithm to quickly find log messages within a user-passed datetime window.
The main benefit of this complication was to avoid unnecessary disk reads. i.e. instead of reading the entire file from disk, only a subset of Blocks were read from disk (defaulting to 64KB-sized Blocks). This supported the author's originating scenario where very large log files (GB sized) were read over a low-bandwidth high-latency network connection from an overburdened SMB share.
This binary search design affects structs SyslogProcessor, SyslineReader, and BlockReader.

However...

Having random file seeks (seeks that may go backwards) does not dovetail with compressed files; compressed files must always be read from the beginning of the file up to the requested file offset. (see #12 (comment)).

This creates a problem when performing a binary search: any "jumps backwards" must either

A. read cached data Blocks, or
B. re-read the file starting from the beginning

and

A. implies that nearly the entire file may be kept in memory at one time (a problem when dealing with, for example, multiple GB-sized log files, which was the situation in the author's originating scenario; see Issue improve memory usage for archived or compressed files #182)
B. implies a O(log₂(n) * n) read time instead of a O(log₂(n)) (binary search) or O(n) (linear search), (n being the size of the file)

Suggested behavior

Some entity should decide if a file can be read with "random seeks" or "sequential seeks".
The overlying users of BlockReader, (SyslogProcessor and it's SyslineReader) must know the datetime search strategy to use. So some entity must decide on the "search mode" ahead of time, most likely a function in filepreprocessor.rs.

Then those entities that do the search for the datetime nearest the beginning of the user-passed datetime window, those entities would perform the search strategy that is appropriate, i.e. "sequential read mode" implies a linear search from the file beginning, "random read mode" implies a binary search.

Other

Given a sequential read mode, this is one step toward fixing the problem highlighted by Issue #12, as the file size of the uncompressed file would not need to be known. If the file size does not need to be known then this is one step toward implementing Chained Block Reads (Issue #14).

Also, reading in a "forward only" linear read mode has an incremental approach will be necessary for Issue #7.

Also, a sequential/linear read mode means Blocks can be progressively dropped as the file is processed. This would improve the problem of high memory usage for compressed files (Issue #182).

This search strategy difference is also hinted-at in various code comments; syslogprocessor.rs#L822-L828, syslinereader.rs#L2474-L2520.

The text was updated successfully, but these errors were encountered:

Issue #283

Preparation for Issue #283

support parsing lz4 compressed files `.lz4` refactor processing sequential-only syslogs (compressed logs that can only be read linearly; no binary searching) add logs for LZ4 `.lz4`, and old LZMA `.lz`, and variations of those add tests in compare-current-and-expected add lz4 flamegraph in flamegraphs.sh Fix Issue #201 with tests that had unknown panics for blockreader processing gzip files. Issue PSeitz/lz4_flex#159 Issue #201 Issue #128 Issue #291 Issue #283

support parsing lz4 compressed files `.lz4` refactor processing sequential-only syslogs (compressed logs that can only be read linearly; no binary searching) add logs for LZ4 `.lz4`, and old LZMA `.lz`, and variations of those add tests in compare-current-and-expected add lz4 flamegraph in flamegraphs.sh Fix Issue #201 with tests that had unknown panics for blockreader processing gzip files. README.md mention lz4 and update comparison table Issue PSeitz/lz4_flex#159 Issue #201 Issue #128 Issue #291 Issue #283

Issue #283 Issue #291 Issue #285

jtmoon79 added code improvement enhancement not seen by the user difficult A difficult problem; a major coding effort or difficult algorithm to perfect file parser labels Apr 20, 2024

jtmoon79 changed the title ~~refactor datetime searching and general file processing to support "forward seek" mode or "random seek" mode~~ refactor datetime searching and file processing to support "forward seek" mode or "random seek" mode Apr 20, 2024

This was referenced Apr 20, 2024

refactor FileType so compressed and archived file handling is transparent #285

Closed

read .xz file by requested block #12

Open

jtmoon79 added a commit that referenced this issue May 6, 2024

blockreader.rs add read_block_last

6d89273

Issue #283

jtmoon79 added a commit that referenced this issue May 6, 2024

syslinereader.rs add file_offset_last

f5d9f89

Preparation for Issue #283

jtmoon79 added a commit that referenced this issue May 29, 2024

README.md note lz4, journal evtx temp file, other updates

458dedd

Issue #283 Issue #291 Issue #285

jtmoon79 closed this as completed May 31, 2024

jtmoon79 mentioned this issue May 31, 2024

read tar file by requested block #13

Open

jtmoon79 mentioned this issue Jun 17, 2024

large compressed files with datetime pattern without year, fail to fully process #305

Closed

jtmoon79 mentioned this issue Jul 16, 2024

improve memory usage for archived or compressed files #182

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor datetime searching and file processing to support "forward seek" mode or "random seek" mode #283

refactor datetime searching and file processing to support "forward seek" mode or "random seek" mode #283

jtmoon79 commented Apr 20, 2024 •

edited

Loading

refactor datetime searching and file processing to support "forward seek" mode or "random seek" mode #283

refactor datetime searching and file processing to support "forward seek" mode or "random seek" mode #283

Comments

jtmoon79 commented Apr 20, 2024 • edited Loading

Current Behavior

Suggested behavior

Other

jtmoon79 commented Apr 20, 2024 •

edited

Loading