document how CLVM compression works and its format

Chia-Network · Jan 6, 2025 · 2573ed1 · 2573ed1
1 parent 70d7623
commit 2573ed1
Showing 1 changed file with 154 additions and 0 deletions.
diff --git a/docs/compressed-serialization.md b/docs/compressed-serialization.md
@@ -0,0 +1,154 @@
+# Compressed Serialization
+
+With the Chia hard fork at height 5'496'000, CLVM can be serialized in a more
+space efficient form, referring back to previous sub-trees instead of
+duplicating them. These references are referred to as "back references".
+
+## Format
+
+The original serialization format had 3 tokens.
+
+- `0xff` - a pair, followed by the left and right sub-trees.
+- `0x80` - NIL, an empty atom
+- Atom - serialized with a UTF-8 like encoding. (see [CLVM serialization](https://chialisp.com/clvm/#serialization))
+
+A back reference is introduced by `0xfe` followed by an atom. The atom refers
+back to an already decoded sub tree. The bits are interpreted just like an
+environment lookup in CLVM. The bits are inspected one at a time, from least
+significant to most significant bits, in big-endian order.
+
+## Paths
+
+```
+            +----------+----------+----------+----------+
+byte index: |  byte 0  |  byte 1  |  byte 2  |  byte 3  |
+            +----------+----------+----------+----------+
+ bit index: | 76543210 | 76543210 | 76543210 | 76543210 |
+            +----------+----------+----------+----------+
+
+bit traversal direction:                          <- x
+```
+
+A 0 bit means follow the left sub-tree, a 1-bit means follow the right sub-tree.
+The last 1-bit is the terminator, and means we should pick the node at the
+current location in the tree.
+
+e.g. The reference `0b1011` means:
+
+- right
+- right
+- left
+- (terminator bit)
+
+It follows the path below:
+
+```
+                 [*]
+                /   \
+               /     \
+              /       \ 1
+             /         \
+            /           \
+           /             \
+         [ ]             [*]
+        /   \           /   \ 1
+       /     \         /     \
+     [ ]     [ ]     [ ]     [*]
+     / \     / \     /  \  0 / \
+   [ ] [ ] [ ] [ ] [ ] [ ] [*] [ ]
+```
+
+## Parsing
+
+Back references refer into the "parse stack". This is a CLVM tree that's updated
+as we parse, so what a back reference refers to changes as we parse the
+serialized CLVM tree. To understand what the parse stack is, we first need to
+look at how CLVM is parsed.
+
+The parser has a stack of _operations_ and a stack of the parsed result (the
+parse stack).
+
+There are 2 operations that can be pushed onto the operations stack:
+
+- `Cons` - Construct a pair (cons box)
+- `Traverse` - parse a sub-tree
+
+As outlined in the [Format](#Format) section, there are two tokens we can
+encounter when parsing; an atom or a pair (followed by the left- and right
+sub-trees).
+
+We keep popping operations off of the op-stack until it's empty. We take the
+following actions dependin on the operation:
+
+- `Traverse`, inspect the next byte of the input stream. If it's a pair (`0xff`)
+  we push `Cons`, `Traverse`, `Traverse` onto the operations stack. If it's an
+  atom, parse the atom and push it into the parse stack.
+
+- `Cons`, pop two nodes from the parse stack, create a new pair with those nodes
+  as the left and right side. Push the resulting pair onto the stack.
+
+### Example
+
+To parse the tokens: `0xff` `1` `0xff` `2` `foobar`, the two stacks end up like
+this while parsing. The stacks grow to the right in this illustration.
+
+| step              | op-stack                       | parse-stack              |
+| ----------------- | ------------------------------ | ------------------------ |
+| 1, initial state  | Traverse                       |                          |
+| 2, parse `0xff`   | Cons, Traverse, Traverse       |
+| 3, parse `1`      | Cons, Traverse                 | `1`                      |
+| 4, parse `0xff`   | Cons, Cons, Traverse, Traverse | `1`                      |
+| 5, parse `2`      | Cons, Cons, Traverse           | `1`, `2`                 |
+| 6, parse `foobar` | Cons, Cons                     | `1`, `2`, `foobar`       |
+| 7, pop2 and cons  | Cons                           | `1`, (`2` . `foobar`)    |
+| 8, pop2 and cons  |                                | (`1` . (`2` . `foobar`)) |
+
+## parse stack
+
+When a back-reference token (`0xfe`) is encountered, the parse stack in that
+current state is used as the environment for the back-reference path to look up
+what node to place at this position in the resulting tree.
+
+The parse stack is itself a LISP list of items. The top of the stack is the head
+of the list.
+
+e.g.
+
+The stack `1`, `2`, `3`, would have the following LISP structure:
+
+```
+(`1` . (`2` . (`3` . NIL)))
+```
+
+A back reference to `3` would be: `0b1100` (right, left).
+
+### reference the stack itself
+
+Back references aren't limited to just referencing items in the stack, but can
+reference any node in the stack. For example, consider the following structure:
+
+`0xff` `foobar` `0xff` `foobar` NIL
+
+After having parsed the first `foobar`, the parse stack will be (`foobar` . NIL)
+(a list with one item). The whole next part of the CLVM tree can be replaced
+with the parse stack itself. i.e. We can use a back-reference of `1`. We then
+get the NIL and the cons box "for free". It's implied by the parse stack.
+
+In practice, however, this rarely happens.
+
+## generating back references
+
+When serializing with compression, we need to assign a tree-hash and an
+(uncompressed) serialized length to every node. When deciding whether to output
+the sub-tree itself or a back-reference, we need to know whether we have already
+serialized an identical sub tree. If we have, we then have to perform a search
+from that node up all of its parents until we reach the top of the parse stack.
+This, additionally, requires a data structure that knows about the parents of
+all nodes.
+
+This search is performed in `find_path()`. There may be multiple paths leading
+to the stack (if the same structure is repeated in multiple places). We pick the
+_shortest_ path. This path may still be quite long, if the stack is deep or if
+the node is found deep down in a CLVM structure. We need to compare the length
+of the path against the serialized-length of the subtree. If the path is longer,
+it would be a net loss to replace it with a back reference.