Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor datetime searching and file processing to support "forward seek" mode or "random seek" mode #283

Closed
jtmoon79 opened this issue Apr 20, 2024 · 0 comments
Labels
code improvement enhancement not seen by the user difficult A difficult problem; a major coding effort or difficult algorithm to perfect file parser

Comments

@jtmoon79
Copy link
Owner

jtmoon79 commented Apr 20, 2024

Current Behavior

This project's original design used a binary search algorithm to quickly find log messages within a user-passed datetime window.
The main benefit of this complication was to avoid unnecessary disk reads. i.e. instead of reading the entire file from disk, only a subset of Blocks were read from disk (defaulting to 64KB-sized Blocks). This supported the author's originating scenario where very large log files (GB sized) were read over a low-bandwidth high-latency network connection from an overburdened SMB share.
This binary search design affects structs SyslogProcessor, SyslineReader, and BlockReader.

However...

Having random file seeks (seeks that may go backwards) does not dovetail with compressed files; compressed files must always be read from the beginning of the file up to the requested file offset. (see #12 (comment)).

This creates a problem when performing a binary search: any "jumps backwards" must either

  • A. read cached data Blocks, or
  • B. re-read the file starting from the beginning

and

  • A. implies that nearly the entire file may be kept in memory at one time (a problem when dealing with, for example, multiple GB-sized log files, which was the situation in the author's originating scenario; see Issue improve memory usage for archived or compressed files #182)
  • B. implies a O(log2(n) * n) read time instead of a O(log2(n)) (binary search) or O(n) (linear search), (n being the size of the file)

Suggested behavior

Some entity should decide if a file can be read with "random seeks" or "sequential seeks".
The overlying users of BlockReader, (SyslogProcessor and it's SyslineReader) must know the datetime search strategy to use. So some entity must decide on the "search mode" ahead of time, most likely a function in filepreprocessor.rs.

Then those entities that do the search for the datetime nearest the beginning of the user-passed datetime window, those entities would perform the search strategy that is appropriate, i.e. "sequential read mode" implies a linear search from the file beginning, "random read mode" implies a binary search.

Other

Given a sequential read mode, this is one step toward fixing the problem highlighted by Issue #12, as the file size of the uncompressed file would not need to be known. If the file size does not need to be known then this is one step toward implementing Chained Block Reads (Issue #14).

Also, reading in a "forward only" linear read mode has an incremental approach will be necessary for Issue #7.

Also, a sequential/linear read mode means Blocks can be progressively dropped as the file is processed. This would improve the problem of high memory usage for compressed files (Issue #182).

This search strategy difference is also hinted-at in various code comments; syslogprocessor.rs#L822-L828, syslinereader.rs#L2474-L2520.

@jtmoon79 jtmoon79 added code improvement enhancement not seen by the user difficult A difficult problem; a major coding effort or difficult algorithm to perfect file parser labels Apr 20, 2024
@jtmoon79 jtmoon79 changed the title refactor datetime searching and general file processing to support "forward seek" mode or "random seek" mode refactor datetime searching and file processing to support "forward seek" mode or "random seek" mode Apr 20, 2024
jtmoon79 added a commit that referenced this issue May 6, 2024
jtmoon79 added a commit that referenced this issue May 6, 2024
jtmoon79 added a commit that referenced this issue May 6, 2024
support parsing lz4 compressed files `.lz4`

refactor processing sequential-only syslogs (compressed logs that can only
be read linearly; no binary searching)

add logs for LZ4 `.lz4`, and old LZMA `.lz`, and variations of those

add tests in compare-current-and-expected

add lz4 flamegraph in flamegraphs.sh

Fix Issue #201 with tests that had unknown panics for blockreader
processing gzip files.

Issue PSeitz/lz4_flex#159
Issue #201
Issue #128
Issue #291
Issue #283
jtmoon79 added a commit that referenced this issue May 6, 2024
support parsing lz4 compressed files `.lz4`

refactor processing sequential-only syslogs (compressed logs that can only
be read linearly; no binary searching)

add logs for LZ4 `.lz4`, and old LZMA `.lz`, and variations of those

add tests in compare-current-and-expected

add lz4 flamegraph in flamegraphs.sh

Fix Issue #201 with tests that had unknown panics for blockreader
processing gzip files.

Issue PSeitz/lz4_flex#159
Issue #201
Issue #128
Issue #291
Issue #283
jtmoon79 added a commit that referenced this issue May 7, 2024
support parsing lz4 compressed files `.lz4`

refactor processing sequential-only syslogs (compressed logs that can only
be read linearly; no binary searching)

add logs for LZ4 `.lz4`, and old LZMA `.lz`, and variations of those

add tests in compare-current-and-expected

add lz4 flamegraph in flamegraphs.sh

Fix Issue #201 with tests that had unknown panics for blockreader
processing gzip files.

Issue PSeitz/lz4_flex#159
Issue #201
Issue #128
Issue #291
Issue #283
jtmoon79 added a commit that referenced this issue May 7, 2024
support parsing lz4 compressed files `.lz4`

refactor processing sequential-only syslogs (compressed logs that can only
be read linearly; no binary searching)

add logs for LZ4 `.lz4`, and old LZMA `.lz`, and variations of those

add tests in compare-current-and-expected

add lz4 flamegraph in flamegraphs.sh

Fix Issue #201 with tests that had unknown panics for blockreader
processing gzip files.

Issue PSeitz/lz4_flex#159
Issue #201
Issue #128
Issue #291
Issue #283
jtmoon79 added a commit that referenced this issue May 7, 2024
support parsing lz4 compressed files `.lz4`

refactor processing sequential-only syslogs (compressed logs that can only
be read linearly; no binary searching)

add logs for LZ4 `.lz4`, and old LZMA `.lz`, and variations of those

add tests in compare-current-and-expected

add lz4 flamegraph in flamegraphs.sh

Fix Issue #201 with tests that had unknown panics for blockreader
processing gzip files.

README.md mention lz4 and update comparison table

Issue PSeitz/lz4_flex#159
Issue #201
Issue #128
Issue #291
Issue #283
jtmoon79 added a commit that referenced this issue May 7, 2024
support parsing lz4 compressed files `.lz4`

refactor processing sequential-only syslogs (compressed logs that can only
be read linearly; no binary searching)

add logs for LZ4 `.lz4`, and old LZMA `.lz`, and variations of those

add tests in compare-current-and-expected

add lz4 flamegraph in flamegraphs.sh

Fix Issue #201 with tests that had unknown panics for blockreader
processing gzip files.

README.md mention lz4 and update comparison table

Issue PSeitz/lz4_flex#159
Issue #201
Issue #128
Issue #291
Issue #283
jtmoon79 added a commit that referenced this issue May 7, 2024
support parsing lz4 compressed files `.lz4`

refactor processing sequential-only syslogs (compressed logs that can only
be read linearly; no binary searching)

add logs for LZ4 `.lz4`, and old LZMA `.lz`, and variations of those

add tests in compare-current-and-expected

add lz4 flamegraph in flamegraphs.sh

Fix Issue #201 with tests that had unknown panics for blockreader
processing gzip files.

README.md mention lz4 and update comparison table

Issue PSeitz/lz4_flex#159
Issue #201
Issue #128
Issue #291
Issue #283
jtmoon79 added a commit that referenced this issue May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code improvement enhancement not seen by the user difficult A difficult problem; a major coding effort or difficult algorithm to perfect file parser
Projects
None yet
Development

No branches or pull requests

1 participant