Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel operations workstream #193

Open
4 tasks
cosmicexplorer opened this issue Jun 13, 2024 · 0 comments
Open
4 tasks

parallel operations workstream #193

cosmicexplorer opened this issue Jun 13, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@cosmicexplorer
Copy link
Contributor

cosmicexplorer commented Jun 13, 2024

Is your feature request related to a problem? Please describe.
Zip files always retain an index located separately from each entry's possibly-compressed data. This allows performing high-level split/merge operations without de/recompressing file contents. This produces improved performance on benchmarks compared to serially iterating over each entry to extract, or serially iterating over each file to compress.

Describe the solution you'd like
It's possible to extract zip files in parallel (see #72) as well as merge them to create archives in parallel (see discussion in #73).

Describe alternatives you've considered
While parallel zip extraction as in #72 has likely been implemented elsewhere, to my knowledge the parallel split/merge technique in #73 (researched for pex-tool/pex#2175 and prototyped in https://github.com/cosmicexplorer/medusa-zip) has not been discussed or implemented before in other zip tooling (please let me know of any prior art for this!).

Additional context
TODO:

  • refactor reader wrappers to use generic type params in refactor readers to use type parameters and not concrete vtables #207 (this gets us Send bounds)
  • parallel/pipelined extraction in parallel/pipelined extraction #208
  • bulk copy (no de/recompression) with entry renaming as in consume packed wheel cache in zipapp creation pex-tool/pex#2175
    • as in that pex change, bulk copy with renaming enables reconstituting a "parent" zip file from an ordered sequence of "child" zips, which may be used to very quickly reconstruct large zip files from immutable cached components.
    • when renaming is not required, ZipWriter::merge_contents() already works with a single io::copy() call. bulk copy with rename avoids de/recompression of file data, but must edit each renamed local file header and therefore requires O(n) io::copy() calls.
  • parallel split/merge for extremely fast creation as in https://github.com/cosmicexplorer/medusa-zip
    • this zip crate should probably not get into the weeds of crawling the filesystem, which keeps medusa-zip useful as a separate crate, and ensures we don't add too much extraneous code to this one.
    • however, the process of merging an ordered sequence of "child" zips with ZipWriter::merge_contents() can be parallelized, and this is something the zip crate should be able to do.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant