Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: rewrite the EOCD/EOCD64 detection to fix extreme performance regression #247

Merged
merged 23 commits into from
Dec 16, 2024

Conversation

RisaI
Copy link
Contributor

@RisaI RisaI commented Sep 27, 2024

In this PR, a new way of finding the EOCD/EOCD64 blocks is introduced. The motivation is to fix the extreme performance regression introduced in cb2d7ab. This PR is also expected to close #231.

What was wrong

In the previous iteration, the ZipArchive::get_metadata function scanned the entire contents of the archive (in extreme cases multiple times) and contained some amount of backtracking even in the best case scenario of no prepended junk data. There was also a lot of code duplication in the functions for finding EOCD and EOCD64 blocks.

What this PR introduces

A MagicFinder pseudo-iterator is implemented for generic magic needle search from the end of a seekable reader. An additional OptimisticMagicFinder is implemented for the best case scenario, where the archive offset is either known exactly or is equal to zero, so that no scanning is performed, as it would be unnecessary.

The find_and_parse methods of Zip32CentralDirectoryEnd and Zip64CentralDirectoryEnd are replaced with a common find_central_directory function. The function employs the following strategy to locate EOCD and EOCD64 if it is expected to be present:

  • A MagicFinder is used to find EOCD magic bytes
  • The EOCD block is parsed and it's internal validity is checked, discarding the entry if it's invalid
  • It is determined whether the EOCD contents indicate the file is ZIP64
  • In the ZIP32 case:
    • An empty archive is assumed to always be correct and is returned
    • A non empty archive attempts to look for the first CDFH between the relative offset in EOCD and the EOCD offset
    • If found, the archive offset is determined and the entry is returned, otherwise it is discarded
  • In the ZIP64 case:
    • The EOCD64 Locator is parsed at it's expected position
    • Its internal validity is checked
    • EOCD64 magic bytes are searched for between the relative offset in the Locator and Locator's own offset
    • If EOCD64 is found and is internally valid, the archive offset is determined and the EOCD + EOCD64 are returned

The code for get_metadata was simplified, because additional information is now available after finding the EOCDs (the archive offset in particular).

Performance

In my internal testing (reading a 44MB ZIP with 30 files to obtain the file count), the extreme performance regression is gone, while still satisfying all of the tests.

2.1.3: ~6.2ms
2.2.0: ~14s
This PR: ~13.5ms

Clearly there are still some performance regressions left, but according to my testing, the code path I spent optimizing takes only 0.3ms out of the 13.5ms, so the regression must lie elsewhere.

What remains to be done

  • get feedback for the EOCD/CFHD naming convention in ZipErrors
  • deprecating the ArchiveOffset::FromCentralDirectory option?
  • reconciling EOCD and EOCD64 archive comments

Naming conventions for byte blocks

In this PR I use the original naming convention for the different type of blocks (End Of Central Directory instead of Central Directory End, etc.). If this is wrong, please, let me know, I will revert this. If you'd like me to change the rest of the errors to match the official convention, I'd also be happy to do that.

ArchiveOffset::FromCentralDirectory deprecation

This option was previously respected for ZIP32 only, but the logic did not make much sense to me. GitHub search reveals that there is not public repository using this option. Could you, please, clarify, what the intent with this feature was? The way I see it, it's enough to have an option to do offset detection if the initial guess fails and also to have the Known variant to opt-out of the detection mechanism.

Archive comments duality

Previously, this PR introduced a breaking change, where the ZipArchive comment for ZIP64 is now read from EOCD64 instead of EOCD. Instead, I introduced the zip64_comment field in Shared and made the field available separately. Now both can be used by the users and it should be clear which is which.

Copy link
Member

@Pr0methean Pr0methean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is on the right track; please fix the failing tests.

src/read.rs Show resolved Hide resolved
src/read.rs Outdated
Ok((Rc::try_unwrap(footer).unwrap(), shared.build()))
pub(crate) fn get_metadata(config: Config, reader: &mut R) -> ZipResult<Shared> {
// Find the EOCD and possibly EOCD64 entries and determine the archive offset.
let cde = spec::find_central_directory(reader, config.archive_offset)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if something that looks like a valid EOCD or EOCD64 block, but doesn't have a valid central directory in front of it and thus fails try_from, is included in the file comment of a valid ZIP file? We should keep looking for the real one in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll wrap this in a loop and allow the find_central_directory function to continue from the previous EOCD candidate backwards.


// Smaller buffer size would be unable to locate bytes.
// Equal buffer size would stall (the window could not be moved).
debug_assert!(BUFFER_SIZE > magic_bytes.len());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should actually be 2 * BUFFER_SIZE - 1, to ensure that if the entire magic couldn't fit into the window before shifting the window, it can afterward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On each windows pass, the cursor is moved by BUFFER_SIZE - magic_bytes.len() back, so it will contain magic bytes at the boundary. Looking at it now, we actually must move the window by BUFFER_SIZE - magic_bytes.len() + 1 to not count magic bytes exactly at the start of the window twice. The actuall assertion for the window to move should then be BUFFER_SIZE >= magic_bytes.len().

src/read/magic_finder.rs Outdated Show resolved Hide resolved
src/read/magic_finder.rs Show resolved Hide resolved
src/spec.rs Outdated
}

pub(crate) struct CentralDirectoryEndInfo {
pub eocd: (Zip32CentralDirectoryEnd, u64),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split this into two fields for readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rather introduce DataWithPosition<T> to keep it as a single field, because the eocd64 is an Option and having to match two Options when they both must be either Some or None at the same time would be tedious.

src/spec.rs Show resolved Hide resolved
src/spec.rs Outdated
pub fn write<T: Write>(self, writer: &mut T) -> ZipResult<()> {
let (block, comment) = self.block_and_comment()?;
block.write(writer)?;
writer.write_all(&comment)?;
Ok(())
}

pub fn is_zip64(&self) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call this may_be_zip64 instead, because a ZIP32 file may happen to have u16::MAX files or u32::MAX bytes before the central directory.

src/spec.rs Show resolved Hide resolved
src/spec.rs Outdated
continue;
}

// Branch out for zip32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handle the case of u16::MAX files in a ZIP32 I mentioned above. This may mean changing is_zip64() to return a yes/no/maybe enum.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we support these edge cases, then the function can only return maybe/no, because it cannot verify the zip is actually zip64 locally. Changing the name of the function to may_be_zip64 sounds like the better option.

@RisaI
Copy link
Contributor Author

RisaI commented Oct 20, 2024

The tests in CI seem to fail due to clippy lints enforce in the parts of the codebase I did not even touch. The same seems to happen to other PRs in this repository. At a first glance, this is caused by a few clippy defaults being changed in nightly. I will submit another PR to resolve those and then I'll rebase this branch on that one.

@wolfv
Copy link

wolfv commented Oct 31, 2024

This has helped a user pretty greatly when extracting from an network file share (NFS) - I believe seek's are very expensive on NFS. Here is the before and after: prefix-dev/rattler-build#1045 (comment)

@RisaI
Copy link
Contributor Author

RisaI commented Nov 5, 2024

Alright, I finally got to finish the edge case ZIP32 detection. This caused the fuzzer to detect some cases where the library would try to allocate too much data. I handled this by adding an EOCD64 consistency check that invalidates the entry if the number of files would not fit in the central directory. If the tests pass, I think all of the features are now implemented.

@RisaI RisaI requested a review from Pr0methean November 5, 2024 18:40
Pr0methean
Pr0methean previously approved these changes Nov 17, 2024
@Pr0methean Pr0methean added this pull request to the merge queue Nov 17, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 17, 2024
@RisaI
Copy link
Contributor Author

RisaI commented Nov 18, 2024

@Pr0methean I am unable to replicate the clippy errors that are present in the merge queue CI run. Any idea what's wrong?

@Pr0methean
Copy link
Member

Make sure you're running against the nightly toolchain.

@RisaI
Copy link
Contributor Author

RisaI commented Nov 18, 2024

This is, again, unrelated to the code in this branch. I'll submit a new PR.

Signed-off-by: Chris Hennick <4961925+Pr0methean@users.noreply.github.com>
@Pr0methean Pr0methean enabled auto-merge November 19, 2024 16:08
@Pr0methean Pr0methean added this pull request to the merge queue Nov 19, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Nov 19, 2024
@RisaI
Copy link
Contributor Author

RisaI commented Nov 21, 2024

@Pr0methean There was a race with #255 in the merge queue, because that PR tweaked code that is removed by this PR. I merged with master again, so now it should pass.

@RisaI RisaI requested a review from Pr0methean November 21, 2024 23:41
@nickbabcock
Copy link
Contributor

I tried this PR on a 200GB zip file (233899 files within) that I access over a networked share.

I wanted to report back that this PR falls short of v2.1.3 performance which can start processing individual files immediately (and this PR still takes several orders of magnitude longer to start processing).

This isn't to block the good work here, but perhaps closing #231 is too aggressive

@Pr0methean Pr0methean added this pull request to the merge queue Dec 16, 2024
Merged via the queue into zip-rs:master with commit 33c71cc Dec 16, 2024
38 checks passed
@Pr0methean Pr0methean mentioned this pull request Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression: opening large zip files is slow since 2.1.4 because the entire file is scanned
4 participants