-
Notifications
You must be signed in to change notification settings - Fork 39
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Rewritten: BZip3 Format Explanation (#143)
* Rewritten: BZip3 Format Explanation * Added: Link to ImHex Pattern * Added: Little Endian Suffix to Pseudostructs * Bugfix: I accidentally swapped 'file' and 'frame' format around. * Fixed: Used wrong field when defining used block type in size comparison * update filter names * improve phrasing in overview.md --------- Co-authored-by: Kamila Szewczyk <27734421+kspalaiologos@users.noreply.github.com>
- Loading branch information
1 parent
cc01039
commit 972e669
Showing
5 changed files
with
210 additions
and
39 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,180 @@ | ||
# BZip3 Format Specification | ||
|
||
Version 1 | ||
|
||
## Headers | ||
|
||
The File and Frame formats share a similar structure, differing only in whether they include a | ||
block count field. | ||
|
||
### File Header | ||
|
||
``` | ||
+----------------+------------------+--------------------+ | ||
| Header | Chunk 1 | Chunk 2 | | ||
| (9 bytes) | (variable size) | (variable size) | | ||
+----------------+------------------+--------------------+ | ||
``` | ||
|
||
This is created by the CLI tool. | ||
|
||
### Frame Header | ||
|
||
``` | ||
+----------------+------------------+--------------------+ | ||
| Header | Chunk 1 | Chunk 2 | | ||
| (13 bytes) | (variable size) | (variable size) | | ||
+----------------+------------------+--------------------+ | ||
``` | ||
|
||
This is created/read by `bz3_compress` and `bz3_decompress`. | ||
|
||
### Header Structure | ||
|
||
| Field | Type | Description | File Header | Frame Header | | ||
| -------------- | ------ | ------------------------------- | ----------- | ------------ | | ||
| Signature | u8[5] | Fixed "BZ3v1" ASCII string | ✓ | ✓ | | ||
| Max Block Size | u32_le | Maximum decompressed block size | ✓ | ✓ | | ||
| Block Count | u32_le | Number of blocks in the stream | ✗ | ✓ | | ||
|
||
### Validation Rules | ||
|
||
1. **Signature**: Must exactly match "BZ3v1" | ||
2. **Max Block Size**: | ||
- Minimum: 65KiB (66,560 bytes) | ||
- Maximum: 511MiB (535,822,336 bytes) | ||
3. **Block Count** (Frame Format only): | ||
- Must match the actual number of blocks in the stream | ||
- Should be greater than 0 | ||
|
||
### Example Parser | ||
|
||
```c | ||
typedef struct { | ||
uint32_t max_block_size; | ||
uint32_t block_count; // Frame Format only | ||
} bzip3_header_t; | ||
|
||
bool read_bzip3_header(FILE* fp, bzip3_header_t* header, bool is_frame_format) { | ||
char signature[6] = {0}; | ||
|
||
// Read signature | ||
if (fread(signature, 1, 5, fp) != 5) | ||
return false; | ||
|
||
if (strcmp(signature, "BZ3v1") != 0) | ||
return false; | ||
|
||
// Read max block size | ||
uint8_t size_bytes[4]; | ||
if (fread(size_bytes, 1, 4, fp) != 4) | ||
return false; | ||
|
||
header->max_block_size = read_neutral_s32(size_bytes); | ||
|
||
if (header->max_block_size < 65536 || | ||
header->max_block_size > 535822336) | ||
return false; | ||
|
||
// Read block count if Frame Format | ||
if (is_frame_format) { | ||
uint8_t count_bytes[4]; | ||
if (fread(count_bytes, 1, 4, fp) != 4) | ||
return false; | ||
|
||
header->block_count = read_neutral_s32(count_bytes); | ||
|
||
if (header->block_count == 0) | ||
return false; | ||
} | ||
|
||
return true; | ||
} | ||
``` | ||
The integers in BZip3 are written unaligned, in little endian format. | ||
A portable implementation is below. | ||
```c | ||
// Reading a 32-bit integer | ||
static s32 read_neutral_s32(u8 * data) { | ||
return ((u32)data[0]) | | ||
(((u32)data[1]) << 8) | | ||
(((u32)data[2]) << 16) | | ||
(((u32)data[3]) << 24); | ||
} | ||
// Writing a 32-bit integer | ||
static void write_neutral_s32(u8 * data, s32 value) { | ||
data[0] = value & 0xFF; | ||
data[1] = (value >> 8) & 0xFF; | ||
data[2] = (value >> 16) & 0xFF; | ||
data[3] = (value >> 24) & 0xFF; | ||
} | ||
``` | ||
|
||
## Block Format | ||
|
||
After the header, both File and Frame formats contain a sequence of blocks that follow the Block | ||
Format specification. Each block is encapsulated in a chunk structure that defines its size. | ||
|
||
The blocks (***without chunk header***) can be encoded/decoded using the `bz3_encode_block` | ||
and `bz3_decode_block` APIs. | ||
|
||
### Chunk Structure | ||
|
||
```c | ||
// Main block structure | ||
struct Chunk { | ||
u32_le compressedSize; // Size of compressed block | ||
u32_le origSize; // Original uncompressed size | ||
if (origSize < 64) { | ||
SmallBlock block; | ||
} else { | ||
Block block; | ||
} | ||
}; | ||
``` | ||
|
||
### Small Block Format (< 64 bytes) | ||
|
||
For blocks smaller than 64 bytes, no compression is attempted. The data is stored with just a checksum: | ||
|
||
```c | ||
struct SmallBlock { | ||
u32_le crc32; // CRC32 checksum | ||
u32_le literal; // Always 0xFFFFFFFF for small blocks. This is basically an invalid `bwtIndex` | ||
u8 data[parent.compressedSize - 8]; // Uncompressed data | ||
}; | ||
``` | ||
|
||
### Regular Block Format (≥ 64 bytes) | ||
|
||
Larger blocks use a more complex format that supports multiple compression features: | ||
|
||
```c | ||
struct Block { | ||
u32_le crc32; // CRC32 checksum of uncompressed data | ||
u32_le bwtIndex; // Burrows-Wheeler transform index | ||
u8 model; // Compression model flags | ||
|
||
if ((model & 0x02) != 0) | ||
u32_le lzpSize; // Size after LZP compression | ||
if ((model & 0x04) != 0) | ||
u32_le rleSize; // Size after RLE compression | ||
u8 data[parent.compressedSize - (popcnt(model) * 4 + 9)]; | ||
}; | ||
``` | ||
|
||
#### Compression Model | ||
|
||
The `model` byte in regular blocks indicates which compression features were used: | ||
|
||
- `0x02`: LZP (Lempel Ziv Prediction) filter | ||
- `0x04`: RLE (Run-Length Encoding) filter | ||
|
||
## External Resources | ||
|
||
- [BZip3 Pattern for ImHex](https://github.com/WerWolv/ImHex-Patterns/pull/329) |
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# BZip3 Format Documentation | ||
|
||
BZip3 is a modern compression format designed for high compression ratios while maintaining | ||
reasonable decompression speeds. It is intended to provide similar compression ratio and | ||
performance to LZMA and BZip2; as opposed to faster Lempel-Ziv codecs that usually offer worse | ||
compression ratio like ZStandard or LZ4. | ||
|
||
This documentation covers the technical specifications of the BZip3 format. | ||
|
||
## Format Characteristics | ||
|
||
- Block level compression (no streams) | ||
- Maximum block size ranges from 65KiB to 511MiB | ||
- Memory usage of ~(6 x block size), both compression and decompression | ||
- Little-endian encoding for integers | ||
- Embedded CRC32 checksums for data integrity | ||
- Combines LZP, RLE followed by Burrows-Wheeler transform and arithmetic coding coupled with | ||
a statistical predictor. | ||
|
||
## Format Overview | ||
|
||
BZip3 uses two main top-level formats: | ||
|
||
1. **File Format**: The standard format used by the command-line tool | ||
2. **Frame Format**: Used by the high-level API functions `bz3_compress` and `bz3_decompress`. | ||
|
||
These formats are very similar: the file format is a superset of the frame format and thus also | ||
contains a block count field. | ||
|
||
See [bzip3_format.md](./bzip3_format.md) for more details. |