Skip to content

Commit

Permalink
Rewritten: BZip3 Format Explanation (#143)
Browse files Browse the repository at this point in the history
* Rewritten: BZip3 Format Explanation

* Added: Link to ImHex Pattern

* Added: Little Endian Suffix to Pseudostructs

* Bugfix: I accidentally swapped 'file' and 'frame' format around.

* Fixed: Used wrong field when defining used block type in size comparison

* update filter names

* improve phrasing in overview.md

---------

Co-authored-by: Kamila Szewczyk <27734421+kspalaiologos@users.noreply.github.com>
  • Loading branch information
Sewer56 and kspalaiologos authored Dec 14, 2024
1 parent cc01039 commit 972e669
Show file tree
Hide file tree
Showing 5 changed files with 210 additions and 39 deletions.
180 changes: 180 additions & 0 deletions doc/bzip3_format.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
# BZip3 Format Specification

Version 1

## Headers

The File and Frame formats share a similar structure, differing only in whether they include a
block count field.

### File Header

```
+----------------+------------------+--------------------+
| Header | Chunk 1 | Chunk 2 |
| (9 bytes) | (variable size) | (variable size) |
+----------------+------------------+--------------------+
```

This is created by the CLI tool.

### Frame Header

```
+----------------+------------------+--------------------+
| Header | Chunk 1 | Chunk 2 |
| (13 bytes) | (variable size) | (variable size) |
+----------------+------------------+--------------------+
```

This is created/read by `bz3_compress` and `bz3_decompress`.

### Header Structure

| Field | Type | Description | File Header | Frame Header |
| -------------- | ------ | ------------------------------- | ----------- | ------------ |
| Signature | u8[5] | Fixed "BZ3v1" ASCII string |||
| Max Block Size | u32_le | Maximum decompressed block size |||
| Block Count | u32_le | Number of blocks in the stream |||

### Validation Rules

1. **Signature**: Must exactly match "BZ3v1"
2. **Max Block Size**:
- Minimum: 65KiB (66,560 bytes)
- Maximum: 511MiB (535,822,336 bytes)
3. **Block Count** (Frame Format only):
- Must match the actual number of blocks in the stream
- Should be greater than 0

### Example Parser

```c
typedef struct {
uint32_t max_block_size;
uint32_t block_count; // Frame Format only
} bzip3_header_t;

bool read_bzip3_header(FILE* fp, bzip3_header_t* header, bool is_frame_format) {
char signature[6] = {0};

// Read signature
if (fread(signature, 1, 5, fp) != 5)
return false;

if (strcmp(signature, "BZ3v1") != 0)
return false;

// Read max block size
uint8_t size_bytes[4];
if (fread(size_bytes, 1, 4, fp) != 4)
return false;

header->max_block_size = read_neutral_s32(size_bytes);

if (header->max_block_size < 65536 ||
header->max_block_size > 535822336)
return false;

// Read block count if Frame Format
if (is_frame_format) {
uint8_t count_bytes[4];
if (fread(count_bytes, 1, 4, fp) != 4)
return false;

header->block_count = read_neutral_s32(count_bytes);

if (header->block_count == 0)
return false;
}

return true;
}
```
The integers in BZip3 are written unaligned, in little endian format.
A portable implementation is below.
```c
// Reading a 32-bit integer
static s32 read_neutral_s32(u8 * data) {
return ((u32)data[0]) |
(((u32)data[1]) << 8) |
(((u32)data[2]) << 16) |
(((u32)data[3]) << 24);
}
// Writing a 32-bit integer
static void write_neutral_s32(u8 * data, s32 value) {
data[0] = value & 0xFF;
data[1] = (value >> 8) & 0xFF;
data[2] = (value >> 16) & 0xFF;
data[3] = (value >> 24) & 0xFF;
}
```

## Block Format

After the header, both File and Frame formats contain a sequence of blocks that follow the Block
Format specification. Each block is encapsulated in a chunk structure that defines its size.

The blocks (***without chunk header***) can be encoded/decoded using the `bz3_encode_block`
and `bz3_decode_block` APIs.

### Chunk Structure

```c
// Main block structure
struct Chunk {
u32_le compressedSize; // Size of compressed block
u32_le origSize; // Original uncompressed size
if (origSize < 64) {
SmallBlock block;
} else {
Block block;
}
};
```

### Small Block Format (< 64 bytes)

For blocks smaller than 64 bytes, no compression is attempted. The data is stored with just a checksum:

```c
struct SmallBlock {
u32_le crc32; // CRC32 checksum
u32_le literal; // Always 0xFFFFFFFF for small blocks. This is basically an invalid `bwtIndex`
u8 data[parent.compressedSize - 8]; // Uncompressed data
};
```

### Regular Block Format (≥ 64 bytes)

Larger blocks use a more complex format that supports multiple compression features:

```c
struct Block {
u32_le crc32; // CRC32 checksum of uncompressed data
u32_le bwtIndex; // Burrows-Wheeler transform index
u8 model; // Compression model flags

if ((model & 0x02) != 0)
u32_le lzpSize; // Size after LZP compression
if ((model & 0x04) != 0)
u32_le rleSize; // Size after RLE compression
u8 data[parent.compressedSize - (popcnt(model) * 4 + 9)];
};
```

#### Compression Model

The `model` byte in regular blocks indicates which compression features were used:

- `0x02`: LZP (Lempel Ziv Prediction) filter
- `0x04`: RLE (Run-Length Encoding) filter

## External Resources

- [BZip3 Pattern for ImHex](https://github.com/WerWolv/ImHex-Patterns/pull/329)
21 changes: 0 additions & 21 deletions doc/file_format.md

This file was deleted.

4 changes: 0 additions & 4 deletions doc/high_level_format.md

This file was deleted.

14 changes: 0 additions & 14 deletions doc/low_level_format.md

This file was deleted.

30 changes: 30 additions & 0 deletions doc/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# BZip3 Format Documentation

BZip3 is a modern compression format designed for high compression ratios while maintaining
reasonable decompression speeds. It is intended to provide similar compression ratio and
performance to LZMA and BZip2; as opposed to faster Lempel-Ziv codecs that usually offer worse
compression ratio like ZStandard or LZ4.

This documentation covers the technical specifications of the BZip3 format.

## Format Characteristics

- Block level compression (no streams)
- Maximum block size ranges from 65KiB to 511MiB
- Memory usage of ~(6 x block size), both compression and decompression
- Little-endian encoding for integers
- Embedded CRC32 checksums for data integrity
- Combines LZP, RLE followed by Burrows-Wheeler transform and arithmetic coding coupled with
a statistical predictor.

## Format Overview

BZip3 uses two main top-level formats:

1. **File Format**: The standard format used by the command-line tool
2. **Frame Format**: Used by the high-level API functions `bz3_compress` and `bz3_decompress`.

These formats are very similar: the file format is a superset of the frame format and thus also
contains a block count field.

See [bzip3_format.md](./bzip3_format.md) for more details.

0 comments on commit 972e669

Please sign in to comment.