Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading binary delta support #1902

Closed
zorgiepoo opened this issue Jul 23, 2021 · 18 comments
Closed

Upgrading binary delta support #1902

zorgiepoo opened this issue Jul 23, 2021 · 18 comments
Milestone

Comments

@zorgiepoo
Copy link
Member

zorgiepoo commented Jul 23, 2021

From #1899, it is noted xar is deprecated, and our current delta implementation is lacking.

One suggestion is:

I think we could drop xar and bsdiff and replace them with zstd and some basic container format.
zstd, besides having decent compression and great speed, has an option to specify a "prefix" for compression/decompression. This prefix can be set to be the old file version, and then it automatically becomes an optimal file diff. (Prefix is a one-time dictionary. zstd also has an option to generate reusable dictionaries, but AFAIK this isn't suitable for our case, since the dictionaries are slow to generate and are supposed to be reusable for many compressions/decompressions)

zstd sounds promising. https://github.com/facebook/zstd/wiki/Zstandard-as-a-patching-engine has more info about using it for patching files like Sparkle currently does. I personally don't enjoy the slow bsdiff generation times so I'm happy to try something else..

We need to also (perhaps more importantly) switch to a different container format than xar, ideally one that is small(?) in surface and not a hassle to embed / maintain (we never had to embed xar). What are our options here?

Maybe we could upgrade and introduce new major versions to the format in piecemeal, starting first with changing the container format.

We also need to upgrade generate_appcast so it picks the correct version to generate patches from based on the Sparkle bundle in the old app (if it doesn't do that already).

We can drop support for the version 1 format that nobody should be using anymore (like from Sparkle 1.9 or something).

Tentatively, I'm going to classify this as a long term goal with some major undertaking, to be in "2.1" milestone, but would like to be proven otherwise :)

@kornelski
Copy link
Member

One more thing that occurred to me is that we could support deltas that work for multiple versions at the same time. The diffs could use files or file ranges that are common in multiple versions. This way there would be fewer delta files to upload and serve.

@zorgiepoo
Copy link
Member Author

In order to keep download files as small as possible doesn't that imply we'd have to download a byte range of the file and servers would have to support that?

Anyway, I think we should first decide on a container format to move away from xar, and get a development branch that uses a new container and have the unit tests pass.

Is choosing the container format going to be intertwined with the delta compression we choose? I guess we will want to compress the entire container. When we generate binary diffs for specific set of individual files, I'm not sure if those will be compressed or not.

@kornelski
Copy link
Member

I didn't even think about using HTTP range for this. What I had in mind is creating one patch file that is universal enough to apply to different source versions (i.e. as a base use files that are same in all applicable versions, don't reuse files (or fragments of files) that differ between versions).

@zorgiepoo
Copy link
Member Author

zorgiepoo commented Jul 24, 2021

Maybe if size is not sacrificed too much as point of delta binary support is providing smallest downloadable size; this will also likely require significant changes to our current tool. I think container format and compression format is more important now.

@zorgiepoo
Copy link
Member Author

I think probably libarchive and tar is a decent contender; they have a zstd filter. Haven't assessed the size of the dependency though or if the OS version is worthwhile/risk-safe to use. Our bsdiff currently indeed does not do compression as I (expectably) thought, since we compress the entire container.

@kornelski
Copy link
Member

What I had in mind is using zstd prefix feature instead of bsdiff, so getting rid of bsdiff entirely. But that is not compatible with any general archive format, especially not tar.

@kornelski
Copy link
Member

Also tar as a format is a bit of a mess, with tons of legacy limitations and incompatible workarounds for them.

@kornelski
Copy link
Member

I've prototyped zstd as a diffing engine. It's a mixed bag. If files differ significantly, them zstd+prefix is ~10% better than bsdiff+zstd. When files are quite similar, then bsdiff works better, even as bsdiff+gzip.

I've also prototyped universal diffs. If you allow 20% overhead, then one diff file can work for 3-5 versions when differences are small, and even 10 versions when differences are big. However, I'm not sure if that feature is desirable enough on its own to justify format switch.

For ImageOptim.app I'm getting delta files even as large as 50% of the full archive. I've assumed that it was because either bsdiff or xar wasn't compressing them well enough. However, now I see the data is actually quite tricky to compress, and I don't think any more that there's low-hanging fruit here.

@zorgiepoo
Copy link
Member Author

zorgiepoo commented Jul 28, 2021

What I had in mind is using zstd prefix feature instead of bsdiff, so getting rid of bsdiff entirely. But that is not compatible with any general archive format, especially not tar.

Shorter term I think just replacing the container format and keeping bsdiff is a reasonable first goal (getting off of xar is important). Then replacing bsdiff or adding prefix-diff or something else could be a next step; as your interesting results indicates, this could be tricky. I'm not sure if this works if we end up wanting to use zstd prefix and it's not compatible with a general archive format like you say..

Regardless of that, I was also thinking about compression over everything which I'm not sure if you factored in.

If you have a delta of 5 files, and 3 are binary diffed, but 2 are new:

  • Do you binary diff the 3 files without compression, and compress all 5 files together (i.e, what we do currently since our bsdiff does not do compression)
  • Do you binary diff the 3 files with compression (if we switch to something else that requires that), and compress all 5 files together (indicating a double-compression)
  • Or do you binary diff and compress the 3 files, and compress the other 2 new files, and then just stick them in a container..

@zorgiepoo
Copy link
Member Author

zorgiepoo commented Jul 28, 2021

Also tar as a format is a bit of a mess, with tons of legacy limitations and incompatible workarounds for them.

We just need some format that preserves file permissions, type (regular, dir, symlink), path name, contents, and way to add additional properties (like is this a removal, insertion, or diff).. I don't think we want xattr's or acl's. Developing our own, which is an option, may lead to some risk although maybe we could make something more efficient..

@kornelski
Copy link
Member

kornelski commented Jul 28, 2021

I think rolling our own format would be fine. It can be something really simple like a series of <path>\0<permissions><type><length><bsdiff-data> and the whole thing compressed together as a "solid" archive.

@ZhuoxingGuo
Copy link

What I had in mind is using zstd prefix feature instead of bsdiff, so getting rid of bsdiff entirely. But that is not compatible with any general archive format, especially not tar.

Shorter term I think just replacing the container format and keeping bsdiff is a reasonable first goal (getting off of xar is important). Then replacing bsdiff or adding prefix-diff or something else could be a next step; as your interesting results indicates, this could be tricky. I'm not sure if this works if we end up wanting to use zstd prefix and it's not compatible with a general archive format like you say..

Regardless of that, I was also thinking about compression over everything which I'm not sure if you factored in.

If you have a delta of 5 files, and 3 are binary diffed, but 2 are new:

  • Do you binary diff the 3 files without compression, and compress all 5 files together (i.e, what we do currently since our bsdiff does not do compression)
  • Do you binary diff the 3 files with compression (if we switch to something else that requires that), and compress all 5 files together (indicating a double-compression)
  • Or do you binary diff and compress the 3 files, and compress the other 2 new files, and then just stick them in a container..

Agree! Firstly It need to change the container format. It still doesn't work in macOS beta4

@zorgiepoo
Copy link
Member Author

zorgiepoo commented Aug 10, 2021

That was fixed in #1906; you'll need to use a nightly version of BinaryDelta (or compile the latest code from source yourself) and generate new patches to work on that Monterey beta. If it still doesn't work, file a new bug with a repro case (the old and new app).

@zorgiepoo
Copy link
Member Author

zorgiepoo commented Sep 6, 2021

Two more things:

  • SHA256 hashes
  • File content clones / moves / renames (candidate)

@kornelski
Copy link
Member

Don't worry about upgrading SHA here. We have a proper hash in EdDSA signing. Here it's just a consistency check, and for that case SHA-1 isn't broken.

Have a look at Docker — some releases used binary delta (I don't see any currently, not sure why). They have lots of small files. The metadata of these files alone may add up to a significant overhead.

@zorgiepoo
Copy link
Member Author

I wrote a new implementation for the container/archive in #2051 and added a file clone tracking (this subsumes file renames).

@zorgiepoo
Copy link
Member Author

Implementation was merged, now just some polish work. Short summary in 23fc577

This new format introduces:

* A new container format which stores metadata in a way that is more efficient for compression, decompression, and size. Creation time can be 2x faster, apply time can be a few seconds faster, size savings can be 500 KB - couple of MB due to metadata alone.
* An array of supported compression formats including lzma, bzip2, zlib and more. We now default to lzma which is as competitive as bzip2 (which we were using in version 2 format) in applying/creation times, but can save several MB on size.
* Tracking of files from an old app being replicated in different locations in the new app. This can track unchanged files being renamed and can result in significant savings if the files are large.

Version 2 format is still the default. To use version 3, pass --version=3 to BinaryDelta when creating a patch. We will switch the default to version 3 later. generate_appcast support is upcoming.

@zorgiepoo
Copy link
Member Author

Support for new format to try out including tools support (generate_appcast, BinaryDelta) has landed in 2.1.0-beta.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants