Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(handler): add geom_uzip handler #1143

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rxpha3l
Copy link

@rxpha3l rxpha3l commented Mar 3, 2025

geom_uzip is a FreeBSD feature for creating compressed disk images (usually containing UFS). The compression is done in blocks, and the resulting .uzip file can be mounted via the GEOM framework on FreeBSD.

The mkuzip header includes a table with block counts and sizes. The header declares the block size (size of decompressed blocks) and total number of blocks. Block size must be a multiple of 512 and defaults to 16384 in mkuzip.
It has the following structure:

Magic, which is a shebang that is stored on 10 bytes.
Version, which can change and is stored on 13 bytes.
Command, which can change and is stored on 105 bytes.
Block size, stored on 4 bytes.
Block count, stored on 4 bytes.
Table of content (TOC), which depends on the file lentgh.
The TOC is a list of uint64_t offsets into the file for each block. To determine the length of a given block, read the next TOC entry and subtract the current offset from the next offset (this is why there is an extra TOC entry at the end). Each block is compressed using zlib. A standard zlib decompressor will decode them to a block of size block_size.

Unblob parses the TOC to determine end & start offset of the uzip file. It will find the compressed blocks, decompress them using zlib and parses them together to recover the decompressed file. Empty chunks are ignored, which is why the decompressed file with unlbob can be a little bit lighter than the original one.

[Sources]
https://github.com/freebsd/freebsd-src/blob/master/sys/geom/uzip/g_uzip.c

@qkaiser qkaiser self-assigned this Mar 3, 2025
@qkaiser qkaiser linked an issue Mar 3, 2025 that may be closed by this pull request
@qkaiser qkaiser added this to the Internship 2025 milestone Mar 3, 2025
@qkaiser qkaiser added enhancement New feature or request format:compression labels Mar 3, 2025
@rxpha3l rxpha3l force-pushed the geom_uzip branch 2 times, most recently from ea64c85 to b095981 Compare March 5, 2025 14:45
@qkaiser
Copy link
Contributor

qkaiser commented Mar 6, 2025

@vlaci what would be the easiest way to add pyzstd to unblob dependencies in Nix here ? It's not yet in upstream at https://github.com/NixOS/nixpkgs/blob/0fa90d642277de2c67e93204cc5870aba8af5878/pkgs/by-name/un/unblob/package.nix#L59 so we need a way to define it in this branch in the meantime.

@qkaiser
Copy link
Contributor

qkaiser commented Mar 6, 2025

@vlaci what would be the easiest way to add pyzstd to unblob dependencies in Nix here ? It's not yet in upstream at https://github.com/NixOS/nixpkgs/blob/0fa90d642277de2c67e93204cc5870aba8af5878/pkgs/by-name/un/unblob/package.nix#L59 so we need a way to define it in this branch in the meantime.

@rxpha3l I'm using this fix locally, but not sure if it's idiomatic Nix

diff --git a/overlay.nix b/overlay.nix
index 9c5051e..265cd79 100644
--- a/overlay.nix
+++ b/overlay.nix
@@ -29,6 +29,8 @@ final: prev:
         ];
       };
 
+      dependencies = (super.dependencies or []) ++ [ prev.python3.pkgs.pyzstd ];
+
       # remove this when packaging changes are upstreamed
       cargoDeps = final.rustPlatform.importCargoLock {
         lockFile = ./Cargo.lock;

@rxpha3l rxpha3l force-pushed the geom_uzip branch 4 times, most recently from 519c7e2 to 34c6696 Compare March 7, 2025 14:24
decompressor = decompressor_cls()
for chunk in iterate_file(infile, current_offset, compressed_len):
outfile.write(decompressor.decompress(chunk))
return ExtractResult(reports=fs.problems)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to flush() the decompressor? probably not, but safer to do so

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decompressor could hold on some uncompressed bytes, if given a buffer limit, but also in some other cases:

  • lzma.LZMADecompressor: .eof, .needs_data, .unused_data, but only used if we gave a non-negative buffer size
  • pyzstd.ZstdDecompressor has the same fields, and further says in the .decompress() docstring:

Decompress data, return a chunk of decompressed data if possible, or b'' otherwise.

It stops after a frame is decompressed.

  • zlib.decompressorobj() also has .eof, unused_data, but .flush() and .unconsumed_tail as well

So at least pyzstd says it will never return more than a frame's worth of decompressed data at once.

Maybe the others also have some surprising logic, and we need to do some more work to decompress a stream properly (looking at the lzma.decompress function source it is also silently ignoring after eof input - we should probably check for that as well - it is potentially a problem to have extra bytes).

Geom_uzip is a FreeBSD feature for creating compressed disk images
(usually containing UFS). The compression is done in blocks, and
the resulting .uzip file can be mounted via the GEOM framework on
FreeBSD.

The mkuzip header includes a table with block counts and sizes.
The header declares the block size (size of decompressed blocks)
and total number of blocks. Block size must be a multiple of 512
and defaults to 16384 in mkuzip.
It has the following structure:
> Magic, which is a shebang & compression identifier stored on 16 bytes.
> Format, which is a shell command that provides some general information.
> Block size, stored on 4 bytes.
> Block count, stored on 4 bytes.
> Table of content (TOC), which depends on the file lentgh.
The TOC is a list of uint64_t offsets into the file for each block.
To determine the length of a given block, read the next TOC entry
and subtract the current offset from the next offset (this is why
there is an extra TOC entry at the end). Each block is compressed
using zlib. A standard zlib decompressor will decode them to a block
of size block_size.

Unblob parses the TOC to determine end & start offset of the compressed
file. It detects the compression method (zlib, lzma or zstd). Finally
the chunks are decompressed to revocer the inital file. Empty chunks are
ignored, which is why the decompressed file with unlbob can be a little
bit lighter than the original one.

[Sources]
https://github.com/mikeryan/unuzip
https://www.baeldung.com/linux/filesystem-in-a-file
https://docs.python.org/3/library/zlib.html
https://github.com/freebsd/freebsd-src/blob/master/sys/geom/uzip/g_uzip.c
https://parchive.sourceforge.net/docs/specifications/parity-volume-spec/article-spec.html
https://www.mail-archive.com/dev-commits-src-main@freebsd.org/msg34955.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request format:compression
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Support for geom_uzip Compression (FreeBSD mkuzip)
5 participants