diff --git a/README.md b/README.md index 0a5480bc..3e92fced 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,8 @@
- Pco logo: a pico-scale, compressed version of the Pyramid of Khafre in the palm of your hand + Pco logo: a pico-scale, compressed version of the Pyramid of Khafre in the palm of your hand
[![crates.io][crates-badge]][crates-url] @@ -48,20 +51,21 @@ numerical sequences with ## How is Pco so much better than alternatives? Pco is designed specifically for numerical data, whereas alternatives rely on -general-purpose (LZ) compressors that were designed for string or binary data. +general-purpose (LZ) compressors that target string or binary data. Pco uses a holistic, 3-step approach: * **modes**. Pco identifies an approximate structure of the numbers called a - mode and then applies it to all the numbers. + mode and then uses it to split numbers into "latents". As an example, if all numbers are approximately multiples of 777, int mult mode - decomposes each number `x` into latent variables `l_0` and + splits each number `x` into latent variables `l_0` and `l_1` such that `x = 777 * l_0 + l_1`. Most natural data uses classic mode, which simply matches `x = l_0`. * **delta enoding**. Pco identifies whether certain latent variables would be better compressed as - consecutive deltas (or deltas of deltas, or so forth). - If so, it takes consecutive differences. + deltas between consecutive elements (or deltas of deltas, or deltas with + lookback). + If so, it takes differences. * **binning**. This is the heart and most novel part of Pco. Pco represents each (delta-encoded) latent variable as an approximate, @@ -79,11 +83,11 @@ entropy. ### Wrapped or Standalone -Pco is designed to be easily wrapped into another format. +Pco is designed to embed into wrapping formats. It provides a powerful wrapped API with the building blocks to interleave it with the wrapping format. This is useful if the wrapping format needs to support things like nullability, -multiple columns, random access or seeking. +multiple columns, random access, or seeking. The standalone format is a minimal implementation of a wrapped format. It supports batched decompression only with no other niceties. @@ -102,24 +106,19 @@ multiple chunks per file. ### Mistakes to Avoid -You will get disappointing results from Pco if your data: +You may get disappointing results from Pco if your data in a single chunk -* combines semantically different sequences into a single chunk, or -* contains fewer numbers per chunk or page than recommended (see above table). +* combines semantically different sequences, or +* is inherently 2D or higher. -Example: the NYC taxi dataset has `f64` columns for `passenger_base_fare` and -`tolls`. -Suppose we assign these as `fare[0...n]` and `tolls[0...n]` respectively, where +Example: the NYC taxi dataset has `f64` columns for `fare` and +`trip_miles`. +Suppose we assign these as `fare[0...n]` and `trip_miles[0...n]` respectively, where `n=50,000`. * separate chunk for each column => good compression -* single chunk `fare[0], ... fare[n-1], toll[0], ... toll[n-1]` => mediocre - compression -* single chunk `fare[0], toll[0], ... fare[n-1], toll[n-1]` => poor compression - -Similarly, we could compress images by making a separate chunk for each -flattened channel (red, green, blue). -Though dedicated formats like webp likely compress natural images better. +* single chunk `fare[0], ... fare[n-1], trip_miles[0], ... trip_miles[n-1]` => bad compression +* single chunk `fare[0], trip_miles[0], ... fare[n-1], trip_miles[n-1]` => bad compression ## Extra diff --git a/docs/format.md b/docs/format.md index 8bf07bc3..afb6dd71 100644 --- a/docs/format.md +++ b/docs/format.md @@ -8,18 +8,24 @@ Bit packing a component is completed by filling the rest of the byte with 0s. Let `dtype_size` be the data type's number of bits. A "raw" value for a number is a `dtype_size` value that maps to the number -via [its `from_unsigned` function](#numbers---latents). +via [its `from_unsigned` function](#Modes). ## Wrapped Format Components -Pco wrapped format diagram - The wrapped format consists of 3 components: header, chunk metadata, and data pages. Wrapping formats may encode these components any place they wish. - Pco is designed to have one header per file, possibly multiple chunks per -header, and possibly multiple data pages per chunk. +header, and possibly multiple pages per chunk. + +[Plate notation](https://en.wikipedia.org/wiki/Plate_notation) for chunk +metadata component: + +Pco wrapped chunk meta plate notation + +Plate notation for page component: + +Pco wrapped page plate notation ### Header @@ -34,11 +40,12 @@ The header simply consists of So far, these format versions exist: -| format version | first Rust version | deviations from next format version | -|----------------|--------------------|-----------------------------------------------| -| 0 | 0.0.0 | int mult mode unsupported | -| 1 | 0.1.0 | float quant mode and 16-bit types unsupported | -| 2 | 0.3.0 | - | +| format version | first Rust version | deviations from next format version | +|----------------|--------------------|----------------------------------------------| +| 0 | 0.0.0 | IntMult mode unsupported | +| 1 | 0.1.0 | FloatQuant mode and 16-bit types unsupported | +| 2 | 0.3.0 | delta variants and Lookback unsupported | +| 3 | 0.4.0 | - | ### Chunk Metadata @@ -47,22 +54,41 @@ metadata is out of range. For example, if the sum of bin weights does not equal the tANS size; or if a bin's offset bits exceed the data type size. -Each chunk metadata consists of +Each chunk meta consists of * [4 bits] `mode`, using this table: - | value | mode | n latent variables | 2nd latent uses delta? | `extra_mode_bits` | - |-------|--------------|--------------------|------------------------|-------------------| - | 0 | classic | 1 | | 0 | - | 1 | int mult | 2 | no | `dtype_size` | - | 2 | float mult | 2 | no | `dtype_size` | - | 3 | float quant | 2 | no | 8 | - | 4-15 | \ | | | | + | value | mode | n latent variables | `extra_mode_bits` | + |-------|--------------|--------------------|-------------------| + | 0 | Classic | 1 | 0 | + | 1 | IntMult | 2 | `dtype_size` | + | 2 | FloatMult | 2 | `dtype_size` | + | 3 | FloatQuant | 2 | 8 | + | 4-15 | \ | | | + * [`extra_mode_bits` bits] for certain modes, extra data is parsed. See the mode-specific formulas below for how this is used, e.g. as the `mult` or `k` values. -* [3 bits] the delta encoding order `delta_order`. -* per latent variable, +* [4 bits] `delta_encoding`, using this table: + + | value | mode | n latent variables | `extra_delta_bits` | + |-------|--------------|--------------------|--------------------| + | 0 | None | 0 | 0 | + | 1 | Consecutive | 0 | 4 | + | 2 | Lookback | 1 | 10 | + | 3-15 | \ | | | + +* [`extra_delta_bits` bits] + * for `consecutive`, this is 3 bits for `order` from 1-7, and 1 bit for + whether the mode's secondary latent is delta encoded. + An order of 0 is considered a corruption. + Let `state_n = order`. + * for `lookback`, this is 5 bits for `window_n_log - 1`, 4 for + `state_n_log`, and 1 for whether the mode's secondary latent is delta + encoded. + Let `state_n = 1 << state_n_log`. +* per latent variable (ordered by delta latent variables followed by mode + latent variables), * [4 bits] `ans_size_log`, the log2 of the size of its tANS table. This may not exceed 14. * [15 bits] the count of bins @@ -77,17 +103,17 @@ Based on chunk metadata, 4-way interleaved tANS decoders should be initialized using [the simple `spread_state_tokens` algorithm from this repo](../pco/src/ans/spec.rs). -### Data Page +### Page -If there are `n` numbers in a data page, it will consist of `ceil(n / 256)` +If there are `n` numbers in a page, it will consist of `ceil(n / 256)` batches. All but the final batch will contain 256 numbers, and the final batch will contain the rest (<= 256 numbers). -Each data page consists of +Each page consists of * per latent variable, - * if delta encoding is applicable, for `i in 0..delta_order`, - * [`dtype_size` bits] the `i`th delta moment + * if delta encoding is applicable, for `i in 0..state_n`, + * [`dtype_size` bits] the `i`th delta state * for `i in 0..4`, * [`ans_size_log` bits] the `i`th interleaved tANS state index * [0-7 bits] 0s until byte-aligned @@ -117,31 +143,61 @@ It consists of * [8 bits] a byte for the data type * [24 bits] 1 less than `chunk_n`, the count of numbers in the chunk * a wrapped chunk metadata - * a wrapped data page of `chunk_n` numbers + * a wrapped page of `chunk_n` numbers * [8 bits] a magic termination byte (0). ## Processing Formulas Pco compression and decompression steps -### Numbers <-> Latents +In order of decompression steps in a batch: + +### Bin Indices and Offsets -> Latents + +To produce latents, we simply do `l[i] = bin[i].lower + offset[i]`. + +### Delta Encodings + +Depending on `delta_encoding`, the mode latents are further decoded. +Note that the delta latent variable, if it exists, is never delta encoded +itself. + +#### None + +No additional processing is applied. + +##### Consecutive + +Latents are decoded by taking a cumulative sum repeatedly. +The delta state is interpreted as delta moments, which are used to initialize +each cumulative sum, and get modified for the next batch. -Based on the mode, unsigneds are decomposed into latents. +For instance, with 2nd order delta encoding, the delta moments `[1, 2]` +and the deltas `[0, 10, 0]` would decode to the latents `[1, 3, 5, 17, 29]`. + +#### Lookback + +Letting `lookback` be the delta latent variable. +Mode latents are decoded via `l[i] += l[i - lookback[i]]`. + +### Modes + +Based on the mode, latents are joined into the finalized numbers. Let `l0` and `l1` be the primary and secondary latents respectively. Let `MID` be the middle value for the latent type (e.g. 2^31 for `u32`). -| mode | decoding formula | -|-------------|------------------------------------------------------------------------| -| classic | `from_latent_ordered(l0)` | -| int mult | `from_latent_ordered(l0 * mult + l1)` | -| float mult | `int_float_from_latent(l0) * mult + (l1 + MID) ULPs` | -| float quant | `from_latent_ordered((l0 << k) + (l0 << k >= MID ? l1 : 2^k - 1 - l1)` | +| mode | decoding formula | +|------------|------------------------------------------------------------------------| +| Classic | `from_latent_ordered(l0)` | +| IntMult | `from_latent_ordered(l0 * mult + l1)` | +| FloatMult | `int_float_from_latent(l0) * mult + (l1 + MID) ULPs` | +| FloatQuant | `from_latent_ordered((l0 << k) + (l0 << k >= MID ? l1 : 2^k - 1 - l1)` | Here ULP refers to [unit in the last place](https://en.wikipedia.org/wiki/Unit_in_the_last_place). Each data type has an order-preserving bijection to an unsigned data type. For instance, floats have their first bit toggled, and the rest of their bits -bits toggled if the float was originally negative: +toggled if the float was originally negative: ```rust fn from_unsigned(unsigned: u32) -> f32 { @@ -163,28 +219,3 @@ fn from_unsigned(unsigned: u32) -> i32 { i32::MIN.wrapping_add(unsigned as i32) } ``` - -### Latents <-> Deltas - -Latents are converted to deltas by taking consecutive differences -`delta_order` times, and decoded by taking a cumulative sum repeatedly. -Delta moments are emitted during encoding and consumed during decoding to -initialize the cumulative sum. - -For instance, with 2nd order delta encoding, the delta moments `[1, 2]` -and the deltas `[0, 10, 0]` would decode to the latents `[1, 3, 5, 17, 29]`. - -### Deltas <-> Bin Indices and Offsets - -To dissect the deltas, we find the bin that contains each delta `x` and compute -its offset as `x - bin.lower`. -For instance, suppose we have these bins, where we have compute the upper bound -for convenience: - -| bin idx | lower | offset bits | upper (inclusive) | -|---------|-------|-------------|-------------------| -| 0 | 7 | 2 | 10 | -| 1 | 10 | 3 | 17 | - -Then 8 would be in bin 0 with offset 1, and 15 would be in bin 1 with offset 5. -10 could be encoded either as bin 0 with offset 3 or bin 1 with offset 0. diff --git a/dtype_dispatch/README.md b/dtype_dispatch/README.md index 967d52e4..a9b15ab7 100644 --- a/dtype_dispatch/README.md +++ b/dtype_dispatch/README.md @@ -5,8 +5,54 @@ This is a common problem in numerical libraries (think numpy, torch, polars): you have a variety of data types and data structures to hold them, but every function involves matching an enum or converting from a generic to an enum. -Consider this simple API of a hypothetical numerical library supporting the -`length` and `add` functions on arrays, plus `new` and `downcast`: +Example with `i32` and `f32` data types for dynamically-typed vectors, +supporting `.length()` and `.add(other)` operations, plus generic +`new` and `downcast` functions: + +```rust +pub trait Dtype: 'static {} +impl Dtype for i32 {} +impl Dtype for f32 {} + +// register our two macros, `define_an_enum` and `match_an_enum`, constrained +// to the `Dtype` trait, with our variant => type mapping: +dtype_dispatch::build_dtype_macros!( + define_an_enum, + match_an_enum, + Dtype, + { + I32 => i32, + F32 => f32, + }, +); + +// define any enum holding a Vec of any data type! +define_an_enum!( + #[derive(Clone, Debug)] + DynArray(Vec) +); + +impl DynArray { + pub fn length(&self) -> usize { + match_an_enum!(self, DynArray(inner) => { inner.len() }) + } + + pub fn add(&self, other: &DynArray) -> DynArray { + match_an_enum!(self, DynArray(inner) => { + let other_inner = other.downcast_ref::().unwrap(); + let added = inner.iter().zip(other_inner).map(|(a, b)| a + b).collect::>(); + DynArray::new(added).unwrap() + }) + } +} + +// we could also use `DynArray::I32()` here, but just to show we can convert generics: +let x_dynamic = DynArray::new(vec![1_i32, 2, 3]).unwrap(); +let x_doubled_generic = x_dynamic.add(&x_dynamic).downcast::().unwrap(); +assert_eq!(x_doubled_generic, vec![2, 4, 6]); +``` + +Compare this with the same API written manually: ```rust use std::{any, mem}; @@ -88,48 +134,6 @@ powerful macros for you to use. These building blocks can solve almost any dynamic<->generic data type dispatch problem: -```rust -pub trait Dtype: 'static {} -impl Dtype for i32 {} -impl Dtype for f32 {} - -// register our two macros, `define_an_enum` and `match_an_enum`, constrained -// to the `Dtype` trait, with our variant => type mapping: -dtype_dispatch::build_dtype_macros!( - define_an_enum, - match_an_enum, - Dtype, - { - I32 => i32, - F32 => f32, - }, -); - -// define any enum for any `Vec` of a data type! -define_an_enum!( - #[derive(Clone, Debug)] - DynArray(Vec) -); - -impl DynArray { - pub fn length(&self) -> usize { - match_an_enum!(self, DynArray(inner) => { inner.len() }) - } - - pub fn add(&self, other: &DynArray) -> DynArray { - match_an_enum!(self, DynArray(inner) => { - let other_inner = other.downcast_ref::().unwrap(); - let added = inner.iter().zip(other_inner).map(|(a, b)| a + b).collect::>(); - DynArray::new(added).unwrap() - }) - } -} - -// we could also use `DynArray::I32()` here, but just to show we can convert generics: -let x_dynamic = DynArray::new(vec![1_i32, 2, 3]).unwrap(); -let x_doubled_generic = x_dynamic.add(&x_dynamic).downcast::().unwrap(); -assert_eq!(x_doubled_generic, vec![2, 4, 6]); -``` ## Comparisons @@ -150,6 +154,9 @@ which is annoyingly restrictive. For instance, traits with generic associated functions can't be put in a `Box`. +All enums are `#[non_exhaustive]` by default, but the matching macros generated +handle wildcard cases and can be used safely in downstream crates. + ## Limitations At present, enum and container type names must always be a single identifier. @@ -157,5 +164,6 @@ For instance, `Vec` will work, but `std::vec::Vec` and `Vec` will not. You can satisfy this by `use`ing your type or making a type alias of it, e.g. `type MyContainer = Vec>`. -It is also mandatory that you place exactly one attribute on each enum, e.g. -with a `#[derive(Clone, Debug)]`. +It is also mandatory that you place exactly one attribute when defining each +enum, e.g. with a `#[derive(Clone, Debug)]`. +If you don't want any attributes, you can just do `#[derive()]`. diff --git a/images/wrapped_chunk_meta_plate.svg b/images/wrapped_chunk_meta_plate.svg new file mode 100644 index 00000000..e13f9726 --- /dev/null +++ b/images/wrapped_chunk_meta_plate.svg @@ -0,0 +1,63 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/images/wrapped_format.svg b/images/wrapped_format.svg deleted file mode 100644 index bcf0a7c1..00000000 --- a/images/wrapped_format.svg +++ /dev/null @@ -1,208 +0,0 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - diff --git a/images/wrapped_page_plate.svg b/images/wrapped_page_plate.svg new file mode 100644 index 00000000..e2213fd3 --- /dev/null +++ b/images/wrapped_page_plate.svg @@ -0,0 +1,57 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +