Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor FileType so compressed and archived file handling is transparent #285

Closed
jtmoon79 opened this issue Apr 20, 2024 · 0 comments
Closed
Labels
difficult A difficult problem; a major coding effort or difficult algorithm to perfect enhancement New feature or request P1 important

Comments

@jtmoon79
Copy link
Owner

jtmoon79 commented Apr 20, 2024

Summary

Currently enum FileType notion of compressed and archived files presumes those to be plain-text ad-hoc log files ("syslog" files).

Refactor enum FileType to have variants to signify the other file types, FixedStruct, Journal, etc., may also be a compressed or archived.

Current behavior

Currently, a compressed file, log.gz, is given FileType::Gz. It is presumed to be an ad-hoc text log ("syslog" file). This presumption means a compressed fixed struct file, wtmp.gz, is presumed to be a compressed text log file and parsing fails.

pub enum FileType {
    Unset,
    File,
    Gz,
    Tar,
    TarGz,
    Xz,
    FixedStruct { type_: FixedStructFileType },
    Evtx,
    Journal,
    Unparsable,
    Unknown,
}

Suggested behavior

enum FileType and BlockReader should be refactored to more transparently handle files that are compressed or archived. Compression and archival should be generic sub-variants for primary FileType files.

pub enum FileType {
    Unset,
    Text { archival_type: FileArchivalType },
    FixedStruct { archival_type: FileArchivalType, type_: FixedStructFileType },
    Evtx { archival_type: FileArchivalType },
    Journal { archival_type: FileArchivalType },
    Unparsable,
    Unknown,
}
pub enum FileArchivalType {
    Normal,
    Gz,
    Xz,
    Tar,
}

Issue #283 adds file-searching "seeking modes" (sequential seeking or random seeking). That "seeking mode" determination should be based on these proposed enum values.

Other Considerations

FileType::Journal

A FileType::Journal that is compressed or archived (archival_type != Normal) would need to be handled by extracting to a named temporary file (Issue #284).

Multiple levels of archiving or compression

This proposed design presumes "one level" of archiving or compression, e.g. can process file log.tar but cannot process file log.tar.xz. Chained Block Readers (Issue #14) implementation would follow this implementation. Chained Block Readers would require a stack of FileArchivalType held or referenced by the FileType value, e.g. can process file log.tar.xz.

Encoding

While working on this issue, can also cover Issue #16; add to FileType::Text a sub-variant text_encoding.

pub enum TextEncodingType {
    Utf8Ascii,
    Utf16,
    Utf32,
}
@jtmoon79 jtmoon79 added enhancement New feature or request P1 important difficult A difficult problem; a major coding effort or difficult algorithm to perfect labels Apr 20, 2024
@jtmoon79 jtmoon79 changed the title refactor FileType to make compressed and archived files handled transparent refactor enum FileType to make compressed and archived file handling transparent Apr 20, 2024
@jtmoon79 jtmoon79 changed the title refactor enum FileType to make compressed and archived file handling transparent refactor FileType, BlockReader so compressed and archived file handling is transparent Apr 20, 2024
@jtmoon79 jtmoon79 changed the title refactor FileType, BlockReader so compressed and archived file handling is transparent refactor FileType so compressed and archived file handling is transparent Apr 22, 2024
jtmoon79 added a commit that referenced this issue Apr 30, 2024
refactor `enum FileType` to embed archive and storage information in
field variant `archival_type`
Add variant `encoding_type` for `FileType::Text`

refactor `pathbuf_to_filetype` to be more straightforward and recursive

entirely remove `Mimeguess`
Issue #15 (completed)

This part 1 of completing the following issues:
Issue #257
Issue #285
jtmoon79 added a commit that referenced this issue Apr 30, 2024
refactor `enum FileType` to embed archive and storage information in
field variant `archival_type`
Add variant `encoding_type` for `FileType::Text`

refactor `pathbuf_to_filetype` to be more straightforward and recursive

entirely remove `Mimeguess`
Issue #15 (completed)

This part 1 of completing the following issues:
Issue #257
Issue #285
jtmoon79 added a commit that referenced this issue Apr 30, 2024
Refactor `path_to_filetype` to allow filetype_archive (gz, xz)
for parseable files EVTX, FixedStruct, journal.
Allow compressed `.tar` files.
Only allows a "single level" of archival type.

None of these are handled yet.

This is part 2 of:
Issue #257
Issue #285
jtmoon79 added a commit that referenced this issue May 4, 2024
Refactor process_path_tar to be more predictable, and to notify about
unsupported archive in archive files (log.xz with logs.tar).

Issue #7
Issue #14
Issue #16
Issue #285
jtmoon79 added a commit that referenced this issue May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficult A difficult problem; a major coding effort or difficult algorithm to perfect enhancement New feature or request P1 important
Projects
None yet
Development

No branches or pull requests

1 participant