Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modules: consider switching to zstd for modules archives #2744

Closed
myitcv opened this issue Dec 21, 2023 · 5 comments
Closed

modules: consider switching to zstd for modules archives #2744

myitcv opened this issue Dec 21, 2023 · 5 comments
Labels
FeatureRequest New feature or request modules Issues related to CUE modules and the experimental implementation NeedsInvestigation

Comments

@myitcv
Copy link
Member

myitcv commented Dec 21, 2023

What version of CUE are you using (cue version)?

$ cue version
cue version v0.0.0-20231220122134-3d09a4aafb64

go version go1.21.4
      -buildmode exe
       -compiler gc
  DefaultGODEBUG panicnil=1
     CGO_ENABLED 1
          GOARCH arm64
            GOOS linux
             vcs git
    vcs.revision 3d09a4aafb64d459fdc60348dc2da4665759a4b2
        vcs.time 2023-12-20T12:21:34Z
    vcs.modified false

Per @mvdan, Go is adding support for zstd compression of their release archives: golang/go#62446 (comment)

In the longer term, this could be a step toward zstd-compressed modules,
but that would require changing many more moving parts and is not in scope
for this specific proposal.

Given we've still got plenty of room to make design changes for modules (we are only experimenting right now), maybe it would be a good idea to do zstd from day one and not have to support older/worse compression algorithms like deflate.

Points to consider per @mvdan:

  • our module cache will basically always extract module archives, not just for cue/load as it is now, but also for the LSP by design
  • @mvdan suspects that many module archives will be half as large in e.g. tar.zst compared to zip; not just due to the better compression, but also because we'd compress all files together rather than separately.
  • we could always consider a middle ground, like a zip archive with zstd compression, if we must retain addressing single files inside the zip for e.g. io/fs
  • Go's std doesn't have zstd yet (compress/zstd: add new package golang/go#62513) but it's only a matter of time. They already have an internal/zstd decompressor, and they'll very soon want a compressor for net/http. in the meantime, we can use https://pkg.go.dev/github.com/klauspost/compress/zstd
@myitcv myitcv added FeatureRequest New feature or request NeedsInvestigation modules Issues related to CUE modules and the experimental implementation labels Dec 21, 2023
@github-project-automation github-project-automation bot moved this to Backlog in Release v0.8 Feb 13, 2024
@mvdan mvdan moved this from Backlog to v0.8.0-rc.1 in Release v0.8 Feb 13, 2024
@mvdan mvdan moved this to Ready in Modules Roadmap Feb 13, 2024
@myitcv myitcv moved this from Planned to Backlog in Modules Roadmap Feb 14, 2024
@rogpeppe
Copy link
Member

I'm not against this per se, but some nuance on a couple of points above:

our module cache will basically always extract module archives, not just for cue/load as it is now, but also for the LSP by design

This makes it sound a bit like module archives will always be fully extracted to disk for CUE evaluation, but I can certainly imagine significant situations where that might not be necessary: for example when evaluating CUE in a non-interactive situation, for example in a browser or for a one-shot server-side evaluation. The zip format arguably could provide some significant performance advantages there, as it could decompress only files involved in the required packages.

we could always consider a middle ground, like a zip archive with zstd compression, if we must retain addressing single files inside the zip

I'd be inclined to avoid that, for now at least, because of Russ's reservations about tooling support under Windows.

@mvdan
Copy link
Member

mvdan commented Feb 20, 2024

I can certainly imagine significant situations where that might not be necessary: for example when evaluating CUE in a non-interactive situation, for example in a browser or for a one-shot server-side evaluation.

How would you display errors? You would need paths/filenames in some form for the sake of debugging. I guess you could somehow treat the zip archive as a directory, but it would break anything that expects absolute paths to actually be files on disk, e.g. CI failure log viewers or terminals/editors where filenames are linkified.

The zip format arguably could provide some significant performance advantages there, as it could decompress only files involved in the required packages.

I think this argument goes both ways. tar.zst would compress far better than zip, meaning a performance and storage improvement off the bat for serving and fetching module archives - this is a win for everyone. As far as decompressing/extracting, it depends on how often we think the client will need to decompress entire module archives. My opinion is that cmd/cue will do that far more often than not, e.g. running cue export for the sake of errors pointing to real files on disk that the user can open.

I'd be inclined to avoid that, for now at least, because of Russ's reservations about tooling support under Windows.

I agree that zip files with zstd compression are likely not the best option - they do marginally improve compression, but as a middle ground solution, it makes noone happy :)

@rogpeppe
Copy link
Member

I can certainly imagine significant situations where that might not be necessary: for example when evaluating CUE in a non-interactive situation, for example in a browser or for a one-shot server-side evaluation.

How would you display errors? You would need paths/filenames in some form for the sake of debugging. I guess you could somehow treat the zip archive as a directory, but it would break anything that expects absolute paths to actually be files on disk, e.g. CI failure log viewers or terminals/editors where filenames are linkified.

This is a good question. In general even file names aren't sufficient, because they're relative to the local filesystem which varies from place to place. One possibility is to use some kind of URL notation (not impossible because it might well be possible to point directly to the registry from whence the source came), or use a custom notation that identifies the source module and version.

In general, I wouldn't want to make it infeasible to evaluate CUE in situations where there's no available filesystem, and conversely, I think that tying ourselves to file-based error messages is probably a bit too limiting (a temporary file name might mean nothing to a user where a more domain-focused name might be more informative).

The zip format arguably could provide some significant performance advantages there, as it could decompress only files involved in the required packages.

I think this argument goes both ways. tar.zst would compress far better than zip, meaning a performance and storage improvement off the bat for serving and fetching module archives - this is a win for everyone. As far as decompressing/extracting, it depends on how often we think the client will need to decompress entire module archives. My opinion is that cmd/cue will do that far more often than not, e.g. running cue export for the sake of errors pointing to real files on disk that the user can open.

Note that cue export does not need to decompress the entire archive: it could potentially just decompress the packages that are required. With large modules, that could potentially be a significant win.

Note that I'm not against using .tar.zstd in principle, but we should understand the trade-offs before making the leap.

@mvdan
Copy link
Member

mvdan commented Mar 1, 2024

In general, I wouldn't want to make it infeasible to evaluate CUE in situations where there's no available filesystem, and conversely, I think that tying ourselves to file-based error messages is probably a bit too limiting (a temporary file name might mean nothing to a user where a more domain-focused name might be more informative).

Fair enough. To be clear, our error messages are already filename-based today, so I'm talking in practical terms about what is already the status quo.

Note that cue export does not need to decompress the entire archive: it could potentially just decompress the packages that are required. With large modules, that could potentially be a significant win.

Fair enough again. I don't suspect that module archives will become large today, but it's hard to predict how large they might get in the future.

I think we're in general agreement that we're OK with keeping standard zips for our first artifact version application/vnd.cue.module.v1+json. Zips are compatible with io/fs, which we're aiming to move towards for APIs like cue/load, whereas compressed tar archives require extracting the entire (or most of?) the archive to locate a file or implement io/fs methods like ReadDir.

For some rough realistic numbers, I ran a quick test of zip vs tar.zst on our latest alpha source archive:

  • cue-v0.8.0-alpha.4 uncompresed weighs about 15MiB
  • cue-v0.8.0-alpha.4.zip sits at 3.4MiB
  • cue-v0.8.0-alpha.4.tar.zst sits at 2.3MiB with the default compression level (fast, 3), and 1.7MiB with a high level (19)

So it seems like standard zip can take up to twice as much space as a well compressed tar.zst. Network and disk space these days is relatively cheap, so I don't think halving the archive size warrants losing io/fs support.

I'm also warming up to the idea of zip with zstd compression rather than deflate may be the future, e.g. for a application/vnd.cue.module.v2+json in a few years. I found https://nickb.dev/blog/there-and-back-again-with-zstd-zips/ illuminating in this respect; it seems like you can still get decent size reductions by swapping the per-file compression algorithm, to the point that the difference in size between zip+zstd and tar.zst might be rather small in most cases. zstd compression is already part of the ZIP spec, so I suspect it will become rather common in a matter of a few years.

For all the reasons above, I'm happy to ship v0.8.0 as currently implemented, with standard deflate zips. We can consider a v2 with zip+zstd in a few years, for the sake of decent network and disk usage wins, without losing io/fs compatibility at all.

One point we raised with @rogpeppe and @myitcv was to redesign the current modzip package so that it doesn't hard-code assumptions about zip archives, but instead it could be generic to any archive format that is io/fs compatible and can compress file by file. I actually struggled to find another reasonably well established such format, as TAR is definitely not one. Per the zip+zstd blog post above, https://github.com/electron/asar does exist, but isn't nearly as well established, and doesn't really bring significant benefit. I'm not even sure that it's often used with per-file compression.

So I'm actually fine with leaving the modzip API as it is currently. I don't think we will move away from zip archives in the next decade, at least not while io/fs compatibility is the top priority, which it is.

@mvdan
Copy link
Member

mvdan commented Mar 4, 2024

We agree with @rogpeppe to close this as "won't fix" following the reasoning above, and @myitcv is happy with the outcome as well; closing for now. We can create a new issue or proposal in the future for zip+zstd or any other "v2" module archive format.

@mvdan mvdan closed this as completed Mar 4, 2024
@github-project-automation github-project-automation bot moved this from Backlog to Done in Modules Roadmap Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FeatureRequest New feature or request modules Issues related to CUE modules and the experimental implementation NeedsInvestigation
Projects
Archived in project
Status: v0.8.0-rc.1
Development

No branches or pull requests

3 participants