[![crates.io][crates-badge]][crates-url]
@@ -48,20 +51,21 @@ numerical sequences with
## How is Pco so much better than alternatives?
Pco is designed specifically for numerical data, whereas alternatives rely on
-general-purpose (LZ) compressors that were designed for string or binary data.
+general-purpose (LZ) compressors that target string or binary data.
Pco uses a holistic, 3-step approach:
* **modes**.
Pco identifies an approximate structure of the numbers called a
- mode and then applies it to all the numbers.
+ mode and then uses it to split numbers into "latents".
As an example, if all numbers are approximately multiples of 777, int mult mode
- decomposes each number `x` into latent variables `l_0` and
+ splits each number `x` into latent variables `l_0` and
`l_1` such that `x = 777 * l_0 + l_1`.
Most natural data uses classic mode, which simply matches `x = l_0`.
* **delta enoding**.
Pco identifies whether certain latent variables would be better compressed as
- consecutive deltas (or deltas of deltas, or so forth).
- If so, it takes consecutive differences.
+ deltas between consecutive elements (or deltas of deltas, or deltas with
+ lookback).
+ If so, it takes differences.
* **binning**.
This is the heart and most novel part of Pco.
Pco represents each (delta-encoded) latent variable as an approximate,
@@ -79,11 +83,11 @@ entropy.
### Wrapped or Standalone
-Pco is designed to be easily wrapped into another format.
+Pco is designed to embed into wrapping formats.
It provides a powerful wrapped API with the building blocks to interleave it
with the wrapping format.
This is useful if the wrapping format needs to support things like nullability,
-multiple columns, random access or seeking.
+multiple columns, random access, or seeking.
The standalone format is a minimal implementation of a wrapped format.
It supports batched decompression only with no other niceties.
@@ -102,24 +106,19 @@ multiple chunks per file.
### Mistakes to Avoid
-You will get disappointing results from Pco if your data:
+You may get disappointing results from Pco if your data in a single chunk
-* combines semantically different sequences into a single chunk, or
-* contains fewer numbers per chunk or page than recommended (see above table).
+* combines semantically different sequences, or
+* is inherently 2D or higher.
-Example: the NYC taxi dataset has `f64` columns for `passenger_base_fare` and
-`tolls`.
-Suppose we assign these as `fare[0...n]` and `tolls[0...n]` respectively, where
+Example: the NYC taxi dataset has `f64` columns for `fare` and
+`trip_miles`.
+Suppose we assign these as `fare[0...n]` and `trip_miles[0...n]` respectively, where
`n=50,000`.
* separate chunk for each column => good compression
-* single chunk `fare[0], ... fare[n-1], toll[0], ... toll[n-1]` => mediocre
- compression
-* single chunk `fare[0], toll[0], ... fare[n-1], toll[n-1]` => poor compression
-
-Similarly, we could compress images by making a separate chunk for each
-flattened channel (red, green, blue).
-Though dedicated formats like webp likely compress natural images better.
+* single chunk `fare[0], ... fare[n-1], trip_miles[0], ... trip_miles[n-1]` => bad compression
+* single chunk `fare[0], trip_miles[0], ... fare[n-1], trip_miles[n-1]` => bad compression
## Extra
diff --git a/docs/format.md b/docs/format.md
index 8bf07bc3..afb6dd71 100644
--- a/docs/format.md
+++ b/docs/format.md
@@ -8,18 +8,24 @@ Bit packing a component is completed by filling the rest of the byte with 0s.
Let `dtype_size` be the data type's number of bits.
A "raw" value for a number is a `dtype_size` value that maps to the number
-via [its `from_unsigned` function](#numbers---latents).
+via [its `from_unsigned` function](#Modes).
## Wrapped Format Components
-
-
The wrapped format consists of 3 components: header, chunk metadata, and data
pages.
Wrapping formats may encode these components any place they wish.
-
Pco is designed to have one header per file, possibly multiple chunks per
-header, and possibly multiple data pages per chunk.
+header, and possibly multiple pages per chunk.
+
+[Plate notation](https://en.wikipedia.org/wiki/Plate_notation) for chunk
+metadata component:
+
+
+
+Plate notation for page component:
+
+
### Header
@@ -34,11 +40,12 @@ The header simply consists of
So far, these format versions exist:
-| format version | first Rust version | deviations from next format version |
-|----------------|--------------------|-----------------------------------------------|
-| 0 | 0.0.0 | int mult mode unsupported |
-| 1 | 0.1.0 | float quant mode and 16-bit types unsupported |
-| 2 | 0.3.0 | - |
+| format version | first Rust version | deviations from next format version |
+|----------------|--------------------|----------------------------------------------|
+| 0 | 0.0.0 | IntMult mode unsupported |
+| 1 | 0.1.0 | FloatQuant mode and 16-bit types unsupported |
+| 2 | 0.3.0 | delta variants and Lookback unsupported |
+| 3 | 0.4.0 | - |
### Chunk Metadata
@@ -47,22 +54,41 @@ metadata is out of range.
For example, if the sum of bin weights does not equal the tANS size; or if a
bin's offset bits exceed the data type size.
-Each chunk metadata consists of
+Each chunk meta consists of
* [4 bits] `mode`, using this table:
- | value | mode | n latent variables | 2nd latent uses delta? | `extra_mode_bits` |
- |-------|--------------|--------------------|------------------------|-------------------|
- | 0 | classic | 1 | | 0 |
- | 1 | int mult | 2 | no | `dtype_size` |
- | 2 | float mult | 2 | no | `dtype_size` |
- | 3 | float quant | 2 | no | 8 |
- | 4-15 | \ | | | |
+ | value | mode | n latent variables | `extra_mode_bits` |
+ |-------|--------------|--------------------|-------------------|
+ | 0 | Classic | 1 | 0 |
+ | 1 | IntMult | 2 | `dtype_size` |
+ | 2 | FloatMult | 2 | `dtype_size` |
+ | 3 | FloatQuant | 2 | 8 |
+ | 4-15 | \ | | |
+
* [`extra_mode_bits` bits] for certain modes, extra data is parsed. See the
mode-specific formulas below for how this is used, e.g. as the `mult` or `k`
values.
-* [3 bits] the delta encoding order `delta_order`.
-* per latent variable,
+* [4 bits] `delta_encoding`, using this table:
+
+ | value | mode | n latent variables | `extra_delta_bits` |
+ |-------|--------------|--------------------|--------------------|
+ | 0 | None | 0 | 0 |
+ | 1 | Consecutive | 0 | 4 |
+ | 2 | Lookback | 1 | 10 |
+ | 3-15 | \ | | |
+
+* [`extra_delta_bits` bits]
+ * for `consecutive`, this is 3 bits for `order` from 1-7, and 1 bit for
+ whether the mode's secondary latent is delta encoded.
+ An order of 0 is considered a corruption.
+ Let `state_n = order`.
+ * for `lookback`, this is 5 bits for `window_n_log - 1`, 4 for
+ `state_n_log`, and 1 for whether the mode's secondary latent is delta
+ encoded.
+ Let `state_n = 1 << state_n_log`.
+* per latent variable (ordered by delta latent variables followed by mode
+ latent variables),
* [4 bits] `ans_size_log`, the log2 of the size of its tANS table.
This may not exceed 14.
* [15 bits] the count of bins
@@ -77,17 +103,17 @@ Based on chunk metadata, 4-way interleaved tANS decoders should be initialized
using
[the simple `spread_state_tokens` algorithm from this repo](../pco/src/ans/spec.rs).
-### Data Page
+### Page
-If there are `n` numbers in a data page, it will consist of `ceil(n / 256)`
+If there are `n` numbers in a page, it will consist of `ceil(n / 256)`
batches. All but the final batch will contain 256 numbers, and the final
batch will contain the rest (<= 256 numbers).
-Each data page consists of
+Each page consists of
* per latent variable,
- * if delta encoding is applicable, for `i in 0..delta_order`,
- * [`dtype_size` bits] the `i`th delta moment
+ * if delta encoding is applicable, for `i in 0..state_n`,
+ * [`dtype_size` bits] the `i`th delta state
* for `i in 0..4`,
* [`ans_size_log` bits] the `i`th interleaved tANS state index
* [0-7 bits] 0s until byte-aligned
@@ -117,31 +143,61 @@ It consists of
* [8 bits] a byte for the data type
* [24 bits] 1 less than `chunk_n`, the count of numbers in the chunk
* a wrapped chunk metadata
- * a wrapped data page of `chunk_n` numbers
+ * a wrapped page of `chunk_n` numbers
* [8 bits] a magic termination byte (0).
## Processing Formulas
-### Numbers <-> Latents
+In order of decompression steps in a batch:
+
+### Bin Indices and Offsets -> Latents
+
+To produce latents, we simply do `l[i] = bin[i].lower + offset[i]`.
+
+### Delta Encodings
+
+Depending on `delta_encoding`, the mode latents are further decoded.
+Note that the delta latent variable, if it exists, is never delta encoded
+itself.
+
+#### None
+
+No additional processing is applied.
+
+##### Consecutive
+
+Latents are decoded by taking a cumulative sum repeatedly.
+The delta state is interpreted as delta moments, which are used to initialize
+each cumulative sum, and get modified for the next batch.
-Based on the mode, unsigneds are decomposed into latents.
+For instance, with 2nd order delta encoding, the delta moments `[1, 2]`
+and the deltas `[0, 10, 0]` would decode to the latents `[1, 3, 5, 17, 29]`.
+
+#### Lookback
+
+Letting `lookback` be the delta latent variable.
+Mode latents are decoded via `l[i] += l[i - lookback[i]]`.
+
+### Modes
+
+Based on the mode, latents are joined into the finalized numbers.
Let `l0` and `l1` be the primary and secondary latents respectively.
Let `MID` be the middle value for the latent type (e.g. 2^31 for `u32`).
-| mode | decoding formula |
-|-------------|------------------------------------------------------------------------|
-| classic | `from_latent_ordered(l0)` |
-| int mult | `from_latent_ordered(l0 * mult + l1)` |
-| float mult | `int_float_from_latent(l0) * mult + (l1 + MID) ULPs` |
-| float quant | `from_latent_ordered((l0 << k) + (l0 << k >= MID ? l1 : 2^k - 1 - l1)` |
+| mode | decoding formula |
+|------------|------------------------------------------------------------------------|
+| Classic | `from_latent_ordered(l0)` |
+| IntMult | `from_latent_ordered(l0 * mult + l1)` |
+| FloatMult | `int_float_from_latent(l0) * mult + (l1 + MID) ULPs` |
+| FloatQuant | `from_latent_ordered((l0 << k) + (l0 << k >= MID ? l1 : 2^k - 1 - l1)` |
Here ULP refers to [unit in the last place](https://en.wikipedia.org/wiki/Unit_in_the_last_place).
Each data type has an order-preserving bijection to an unsigned data type.
For instance, floats have their first bit toggled, and the rest of their bits
-bits toggled if the float was originally negative:
+toggled if the float was originally negative:
```rust
fn from_unsigned(unsigned: u32) -> f32 {
@@ -163,28 +219,3 @@ fn from_unsigned(unsigned: u32) -> i32 {
i32::MIN.wrapping_add(unsigned as i32)
}
```
-
-### Latents <-> Deltas
-
-Latents are converted to deltas by taking consecutive differences
-`delta_order` times, and decoded by taking a cumulative sum repeatedly.
-Delta moments are emitted during encoding and consumed during decoding to
-initialize the cumulative sum.
-
-For instance, with 2nd order delta encoding, the delta moments `[1, 2]`
-and the deltas `[0, 10, 0]` would decode to the latents `[1, 3, 5, 17, 29]`.
-
-### Deltas <-> Bin Indices and Offsets
-
-To dissect the deltas, we find the bin that contains each delta `x` and compute
-its offset as `x - bin.lower`.
-For instance, suppose we have these bins, where we have compute the upper bound
-for convenience:
-
-| bin idx | lower | offset bits | upper (inclusive) |
-|---------|-------|-------------|-------------------|
-| 0 | 7 | 2 | 10 |
-| 1 | 10 | 3 | 17 |
-
-Then 8 would be in bin 0 with offset 1, and 15 would be in bin 1 with offset 5.
-10 could be encoded either as bin 0 with offset 3 or bin 1 with offset 0.
diff --git a/dtype_dispatch/README.md b/dtype_dispatch/README.md
index 967d52e4..a9b15ab7 100644
--- a/dtype_dispatch/README.md
+++ b/dtype_dispatch/README.md
@@ -5,8 +5,54 @@ This is a common problem in numerical libraries (think numpy, torch, polars):
you have a variety of data types and data structures to hold them, but every
function involves matching an enum or converting from a generic to an enum.
-Consider this simple API of a hypothetical numerical library supporting the
-`length` and `add` functions on arrays, plus `new` and `downcast`:
+Example with `i32` and `f32` data types for dynamically-typed vectors,
+supporting `.length()` and `.add(other)` operations, plus generic
+`new` and `downcast` functions:
+
+```rust
+pub trait Dtype: 'static {}
+impl Dtype for i32 {}
+impl Dtype for f32 {}
+
+// register our two macros, `define_an_enum` and `match_an_enum`, constrained
+// to the `Dtype` trait, with our variant => type mapping:
+dtype_dispatch::build_dtype_macros!(
+ define_an_enum,
+ match_an_enum,
+ Dtype,
+ {
+ I32 => i32,
+ F32 => f32,
+ },
+);
+
+// define any enum holding a Vec of any data type!
+define_an_enum!(
+ #[derive(Clone, Debug)]
+ DynArray(Vec)
+);
+
+impl DynArray {
+ pub fn length(&self) -> usize {
+ match_an_enum!(self, DynArray(inner) => { inner.len() })
+ }
+
+ pub fn add(&self, other: &DynArray) -> DynArray {
+ match_an_enum!(self, DynArray(inner) => {
+ let other_inner = other.downcast_ref::().unwrap();
+ let added = inner.iter().zip(other_inner).map(|(a, b)| a + b).collect::>();
+ DynArray::new(added).unwrap()
+ })
+ }
+}
+
+// we could also use `DynArray::I32()` here, but just to show we can convert generics:
+let x_dynamic = DynArray::new(vec![1_i32, 2, 3]).unwrap();
+let x_doubled_generic = x_dynamic.add(&x_dynamic).downcast::().unwrap();
+assert_eq!(x_doubled_generic, vec![2, 4, 6]);
+```
+
+Compare this with the same API written manually:
```rust
use std::{any, mem};
@@ -88,48 +134,6 @@ powerful macros for you to use.
These building blocks can solve almost any dynamic<->generic data type dispatch
problem:
-```rust
-pub trait Dtype: 'static {}
-impl Dtype for i32 {}
-impl Dtype for f32 {}
-
-// register our two macros, `define_an_enum` and `match_an_enum`, constrained
-// to the `Dtype` trait, with our variant => type mapping:
-dtype_dispatch::build_dtype_macros!(
- define_an_enum,
- match_an_enum,
- Dtype,
- {
- I32 => i32,
- F32 => f32,
- },
-);
-
-// define any enum for any `Vec` of a data type!
-define_an_enum!(
- #[derive(Clone, Debug)]
- DynArray(Vec)
-);
-
-impl DynArray {
- pub fn length(&self) -> usize {
- match_an_enum!(self, DynArray(inner) => { inner.len() })
- }
-
- pub fn add(&self, other: &DynArray) -> DynArray {
- match_an_enum!(self, DynArray(inner) => {
- let other_inner = other.downcast_ref::().unwrap();
- let added = inner.iter().zip(other_inner).map(|(a, b)| a + b).collect::>();
- DynArray::new(added).unwrap()
- })
- }
-}
-
-// we could also use `DynArray::I32()` here, but just to show we can convert generics:
-let x_dynamic = DynArray::new(vec![1_i32, 2, 3]).unwrap();
-let x_doubled_generic = x_dynamic.add(&x_dynamic).downcast::().unwrap();
-assert_eq!(x_doubled_generic, vec![2, 4, 6]);
-```
## Comparisons
@@ -150,6 +154,9 @@ which is annoyingly restrictive.
For instance, traits with generic associated functions can't be put in a
`Box`.
+All enums are `#[non_exhaustive]` by default, but the matching macros generated
+handle wildcard cases and can be used safely in downstream crates.
+
## Limitations
At present, enum and container type names must always be a single identifier.
@@ -157,5 +164,6 @@ For instance, `Vec` will work, but `std::vec::Vec` and `Vec` will not.
You can satisfy this by `use`ing your type or making a type alias of it,
e.g. `type MyContainer = Vec>`.
-It is also mandatory that you place exactly one attribute on each enum, e.g.
-with a `#[derive(Clone, Debug)]`.
+It is also mandatory that you place exactly one attribute when defining each
+enum, e.g. with a `#[derive(Clone, Debug)]`.
+If you don't want any attributes, you can just do `#[derive()]`.
diff --git a/images/wrapped_chunk_meta_plate.svg b/images/wrapped_chunk_meta_plate.svg
new file mode 100644
index 00000000..e13f9726
--- /dev/null
+++ b/images/wrapped_chunk_meta_plate.svg
@@ -0,0 +1,63 @@
+
diff --git a/images/wrapped_format.svg b/images/wrapped_format.svg
deleted file mode 100644
index bcf0a7c1..00000000
--- a/images/wrapped_format.svg
+++ /dev/null
@@ -1,208 +0,0 @@
-
diff --git a/images/wrapped_page_plate.svg b/images/wrapped_page_plate.svg
new file mode 100644
index 00000000..e2213fd3
--- /dev/null
+++ b/images/wrapped_page_plate.svg
@@ -0,0 +1,57 @@
+