docs: draft spec for mast binary format #132

bitwalker · 2024-02-22T05:54:47Z

This PR adds a new document which describes the new in-memory representation we've been discussing recently, as well as provides a draft specification for the MAST binary format. This isn't meant to be the canonical source for the specification or anything, I just found it convenient to draft it here, and figured it would be a good way to get some feedback before opening a more formal proposal in the miden-vm repo.

@bobbinth You're going to be the one most interested in what's here. Feel free to skip over the section that details aspects of the in-memory representation. Nothing has really changed there since our discussion over in the miden-vm repo, but I tried to write up a bit more of a summary/intro document to make sure I understood all the pieces more or less. The primary thing I'd like to get your opinion on is the specification of the binary format itself. It's not a formal spec, but it is precisely defined.

@greenhat Also interested in any notes you have, a lot of the details here are quite similar to the package format we're also discussing, so it should be pretty familiar already. The two formats are intended to have some synergistic properties, so that's part of the reason they are so similar.

Anyway, once you both feel like this is ready to propose in the miden-vm repo, I'll open an issue for it there. Consider this an early draft.

bobbinth

Thank you! This looks great as well. I've also just skimmed though the document for now, but left a few comments inline.

docs/src/appendix/mast.md

greenhat

Looking great overall! Please, check my notes.

docs/src/appendix/mast.md

greenhat · 2024-02-23T10:45:44Z

docs/src/appendix/mast.md

+
+pub struct OpBatch<const BATCH_SIZE: usize = 8> {
+    ops: Vec<Operation>,
+    groups: [Felt; BATCH_SIZE],


I cannot figure out the purpose of the groups (especially the use of Felt). Could you please clarify this?

The primary reason for this is the way the VM decodes programs. Since Felt is the smallest unit the VM understands natively, we encode up to 9 operations into a single Felt and we call these an "operation group". Then, up to 8 groups can be absorbed by a single permutation of the hash function we use (RPO) and the VM needs to execute additional operations between decoding more than 8 groups (i.e., a single batch). At the very high level, when we execute a basic block, the VM actually executes the following operations (the basic block consists of 3 batches):

SPAN <operations from batch 0> RESPAN <operations from batch 1> RESPAN <operations from batch 2> END

Got it. Thanks for the detailed explanation!

I was also confused by this when reviewing the implementation of things in miden-core, but after re-reading things several times I was able to arrive at more or less what @bobbinth explained here. As an aside, your explanation here is actually better, and more succinct than anything I came across, we might actually want to add it to the code in miden-core as a brief summary for future readers!

docs/src/appendix/mast.md

greenhat · 2024-02-23T12:36:29Z

docs/src/appendix/mast.md

+    /// The offset in the trailing data section of `EncodedMastForest`
+    /// where the data for this node starts. If this node has no data,
+    /// then the value will always be zero. Otherwise, the data should
+    /// be decoded starting at this offset, the specifics of which
+    /// depend on the node type
+    offset: DataOffset,


What if the node's data is the "first" in the buffer? It has a zero offset, but we treat it as "no data".

What actually determines if the node has data or not is the MastNodeType. So even if offset is 0, if the node type has associated data, we will interpret offset as a valid offset.

The comment here is referring to the fact that if there is no associated data, the value will always be zero, regardless of the current offset in data where the next write will occur. For example, if the current offset in data is 128, and we're encoding a MastNodeInfo for a node that has no associated data, then we will not store 128 as the offset in the encoded MastNodeInfo, we will store 0.

In reality, either choice would be fine, but 0 has the smallest possible encoding, and I didn't want to waste bytes encoding an offset that is never used.

greenhat · 2024-02-23T12:53:09Z

docs/src/appendix/mast.md

+}
+```
+
+The only thing to note here is that we want the `MastNodeInfo` type to be fixed-width, to enable indexing. We also want to keep the encoding compact, but efficient. As a result, we are making a slight tradeoff with the encoding of this type to make traversal fast, by inlining references to other nodes. This means that there are padding bytes introduced for node types which do not have such references, and which cannot take advantage of the padding to store a subset of their data. This is estimated to be a worthwhile tradeoff, but isn't based on any benchmarks, just gut feeling.


Assuming it goes on-chain, I'd rather optimize it heavily for size. We can always offload slow("analytical") processing to an indexer, but the on-chain format is almost set in stone and would be a constantly growing "headache". Keeping in mind that the code size in smart contracts tends to be rather small, so we are talking kilobytes and not megabytes.

My expectation is that we are going to be compressing the serialized binary output of the encoder, and decompressing prior to feeding it into the decoder. Thus making our binary format highly structured is going to make it easier to compress to a very tiny size.

However, I do have two other possible encoding schemes in mind, which are variations of this one that use different techniques for serialization/deserialization with different tradeoffs. I want to evaluate this one first, to see the sizes of the compressed binary output before evaluating the others, as they are more complex, and thus produce smaller output, but that may make them less amenable to compression, and slower to encode/decode.

One of the small modifications we could easily make to this format though, is to delta-encode things like indices/offsets, as they are monotonically increasing. This could keep virtually all such indices/offsets as small as 1 byte for any given program. It also doesn't add much in terms of serialization/deserialization overhead.

I'm not particularly worried about padding bytes though, they essentially disappear under even the most trivial compression schemes. I'm much more interested in how to exploit the structure of MAST programs to make the encoding both small and compressible to the maximum degree possible.

Fumuran · 2024-03-08T18:58:54Z

Looks great! I don't think I can say something in addition to what was written above, but for me it looks awesome.

bitwalker · 2024-03-17T21:09:04Z

I've updated the draft document with the feedback everyone has provided so far. A few things were more questions/clarifying of things, and so I've left those comments unresolved for subsequent review. Let me know if you have further additional feedback!

I should also note that 0xPolygonMiden/miden-vm#1277 makes changes to Miden Assembly that will make transitioning to MAST as the binary format quite trivial (once the actual encoder/decoder is written), since all invocation targets (i.e. the callee of instructions like call) are resolved to MAST digests, including exec (by using PROXY nodes, what is called External in this draft proposal). As a result, we won't have to make any changes to the assembler to implement this, aside from emitting MAST as the binary form rather than MASM.

bobbinth

Looks great! Thank you again for such a comprehensive write up! I left some comments inline - but most are minor (and some may be premature optimizations).

One thing I wanted to mention more specifically (and this also referenced in one of the comments), I think we actually need 2 variants of the format: "fast" and "compact". The current spec defines the "fast" format - which I think is the right one to start with (the "compact" variant we can define much later).

The reason why I think we need to differentiate between the two explicitly is because they have different trust assumptions and structures. Specifically:

The "fast" variant is trusted. That is, the deserializer would trust that the serializer encoded everything correctly. The main benefit of this is that we don't need to do any hashing during deserialization (e.g., we trust that digests included with each node are correct).
The "compact" variant is un-trusted. That is, the deserializer would need to compute hashes of all nodes to verify that they are correct. This introduces significant computational overhead, but it does allow removing the digest field from MastNodeInfo (excluding the External variant) - which, I expect, would reduce the size of the binary quite a bit (primarily because digests are basically uncompressible data).

The "compact" variant would be the one which would be distributed between parties which don't trust each other. But once I download the "compact" variant from someone, I can convert it to "fast" variant and store it on disc - and so the VM would work primarily (or maybe even exclusively) with the "fast" variant.

As mentioned above, we don't need to worry about the "compact" variant now, with the only exception being potential addition of a binary flag to indicate fast/compact encoding for a given file.

bobbinth · 2024-04-14T07:42:03Z

docs/src/appendix/mast.md

+
+#### Variable-Width Integers
+
+First, let's introduce a type that we will use throughout the rest of this spec, `VInt64`. It represents a variable-width, unsigned, little-endian 64-bit integer. It is encoded as one or more bytes, where each byte has 7 bits to represent the value, with the last bit used to signal when another byte is needed to represent the encoded value. In this case, unless otherwise noted, we treat it as an error if a `VInt64`-encoded integer would not fit in the integral type given.


I think the explanation of the VInt64 is not exactly correct here (i.e., I don't believe it uses a "continuation" bit in each byte but rather encodes the total number of bytes using trailing zeros of the first byte).

bobbinth · 2024-04-14T07:45:44Z

docs/src/appendix/mast.md

+    /// The format version, initially all zeros
+    ///
+    /// If future modifications are made to this format, the version
+    /// should be incremented by 1. A version of 0b111 is reserved
+    /// for future extensions that require extending the version
+    /// field itself, but should be considered invalid for now.
+    version: [u8; 3],


We will probably need 2 variations of the format - one for "fast" encoding/decoding and another one is for "compact" encoding (see overall review comment). So, I'm wondering if, in addition to the version field, we also need some binary flag which indicates whether this file is "fast" or "compact".

One other question re version: would this implicitly encode the VM version as well? Specifically, let's say we modify the instruction set in some way - would the format version be incremented? If not, would it make sense to add something like miden_version field here (instead of it being in the package definition)?

bobbinth · 2024-04-14T07:55:02Z

docs/src/appendix/mast.md

+    /// There is one byte for every 8 nodes in the set, rounded to the
+    /// nearest multiple of 8.
+    roots: [u8; (self.node_count / 8) + (self.node_count % 8 > 0 as usize)],


I wonder if using 1 bit per node to indicate that a node is a root is a premature optimization at this point. We are basically assuming that the proportion of roots to nodes is going to be relatively high. But, if for example, we have 1000 nodes and only a dozen roots, recording them as a list of node indexes is going to be about 5x more compact (125 bytes in case of 1 bit per node vs. ~25 bytes in case of recording indexes for a dozen of nodes).

Overall, I'm not opposed to this approach, but I do wonder if we should see some evidence first that this is indeed an improvement over the simpler approach.

bobbinth · 2024-04-14T08:19:31Z

docs/src/appendix/mast.md

+#[repr(u8)]
+pub enum MastNodeType {
+    /// For more efficient decoding, and because of the
+    /// frequency with which these node types appear, we
+    /// directly represent the child indices for `Join`,
+    /// `Split`, and `Loop` inline. This incurs 8 bytes
+    /// of padding for other node types, except `Block`
+    /// which takes advantage of this to also inline some
+    /// of its data


Probably a premature optimization on my part, but I wonder if we should always encode MastNodeType as 8 bytes (rather than the current 9). Basically, the first 4 bits could be used to encode the variant, and then we could still encode 2 node indexes with 30 bits each. This does cap the maximum number of nodes at $2^{30}$, but I don't think that's a particularly big concern.

On the other hand, having MastNodeType aligned on 8-byte boundaries may be beneficial.

bobbinth · 2024-04-14T08:20:33Z

docs/src/appendix/mast.md

+    Block {
+        /// The maximum number of operations in a block
+        /// is always smaller than `u8::MAX`
+        len: u8
+    },


Why is u8::MAX is the maximum operations in a block? I don't think we have such a limit. I think here we could probably use u32::MAX.

bobbinth · 2024-04-14T08:26:18Z

docs/src/appendix/mast.md

+    /// A simple operation, with its opcode encoded in a single byte
+    Op { opcode: u8 },
+    /// An operation with an immediate argument.
+    ///
+    /// Like `Op`, the opcode takes a single byte, and the immediate
+    /// is encoded using a variable-width integer. The immediate must
+    /// be converted to/from `Felt` when decoded.
+    OpWithImmediate {
+        opcode: u8,
+        imm: VInt64,
+    },


I wonder if this distinction is actually needed. By reading the opcode we could tell whether this operation has an immediate or not. Currently, the only operation which has an immediate is the PUSH operation.

bobbinth · 2024-04-14T08:34:57Z

docs/src/appendix/mast.md

+/// Unlike most other type definitions here, this one
+/// is not encoded using the usual Rust layout rules.
+///
+/// Specifically, we treat this as a variable-width type,
+/// without the padding that would normally be introduced
+/// to make smaller variants as large as the largest variant.
+///
+/// So each encoded operation starts with the tag byte, which
+/// indicates which variant we're going to be parsing. We then
+/// parse the data for that variant, and immediately look for
+/// the next tag byte (if there are more ops to parse).
+///
+/// It is strictly for purposes of readability.
+///
+/// NOTE: Decorators are always associated with the next non-decorator
+/// operation in the basic block being decoded.
+#[repr(u8)]
+pub enum EncodedOperation {


Maybe the comment explains exactly this, but I think we don't need a special enum to differentiate different types of operations and decorators. Specifically, we just need 1 byte to encode both the tag and the opcode.

This is possible because operations require only 7 bits to encode. So, we can say that if the first bit of a byte is 0, the remaining 7 bits encode an opcode; if the first bit is 1, the remaining 7 bits encode tag of a decorator.

This way, vast majority of EncodedOperation's would require 1 byte to encode (rather than 2 bytes).

bobbinth · 2024-04-15T05:57:56Z

One other thing I forgot to clarify: would things like source locations and other debug-related data go into the data section? Or these things are not really a part of this spec and would be handled separately (i.e., via the package format)?

bobbinth · 2024-04-15T06:20:30Z

One other thing I forgot to clarify: would things like source locations and other debug-related data go into the data section? Or these things are not really a part of this spec and would be handled separately (i.e., via the package format)?

Actually, re-reading #129, I've answered my own question: debug data goes into the package but not into MAST.

bitwalker · 2024-08-27T06:40:26Z

This has been implemented in 0xPolygonMiden/miden-vm, more or less, so I'm closing this PR since it was primarily meant for drafting the initial spec.

bitwalker added documentation Improvements or additions to documentation todo This is a task related to an incomplete feature labels Feb 22, 2024

bitwalker requested review from greenhat and bobbinth February 22, 2024 05:54

bitwalker self-assigned this Feb 22, 2024

bitwalker mentioned this pull request Feb 22, 2024

Fix duplicate procedure name issue. 0xPolygonMiden/miden-vm#1256

Closed

bitwalker requested a review from Fumuran February 22, 2024 21:16

bobbinth reviewed Feb 23, 2024

View reviewed changes

greenhat requested changes Feb 23, 2024

View reviewed changes

docs: draft spec for mast binary format

6bbe51a

bitwalker force-pushed the bitwalker/docs-mast branch from 9dd74f4 to 6bbe51a Compare March 17, 2024 21:01

bobbinth approved these changes Apr 14, 2024

View reviewed changes

bobbinth mentioned this pull request Apr 15, 2024

docs: initial package format documentation #129

Open

bobbinth mentioned this pull request May 22, 2024

Implement extensible subsystem for on-demand storage/provisioning of MAST objects 0xPolygonMiden/miden-vm#1226

Closed

4 tasks

bobbinth mentioned this pull request Jun 3, 2024

Setup crates publishing #142

Closed

plafer mentioned this pull request Jun 27, 2024

MastForest serialization 0xPolygonMiden/miden-vm#1370

Merged

bitwalker closed this Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: draft spec for mast binary format #132

docs: draft spec for mast binary format #132

bitwalker commented Feb 22, 2024

bobbinth left a comment

greenhat left a comment

greenhat Feb 23, 2024

bobbinth Feb 26, 2024

greenhat Feb 28, 2024

bitwalker Mar 17, 2024

greenhat Feb 23, 2024

bitwalker Mar 17, 2024

greenhat Feb 23, 2024

bitwalker Mar 17, 2024

Fumuran commented Mar 8, 2024

bitwalker commented Mar 17, 2024

bobbinth left a comment

bobbinth Apr 14, 2024

bobbinth Apr 14, 2024

bobbinth Apr 15, 2024

bobbinth Apr 14, 2024 •

edited

Loading

bobbinth Apr 14, 2024 •

edited

Loading

bobbinth Apr 14, 2024

bobbinth Apr 14, 2024

bobbinth Apr 14, 2024

bobbinth commented Apr 15, 2024 •

edited

Loading

bobbinth commented Apr 15, 2024

bitwalker commented Aug 27, 2024


		#### Variable-Width Integers

		First, let's introduce a type that we will use throughout the rest of this spec, `VInt64`. It represents a variable-width, unsigned, little-endian 64-bit integer. It is encoded as one or more bytes, where each byte has 7 bits to represent the value, with the last bit used to signal when another byte is needed to represent the encoded value. In this case, unless otherwise noted, we treat it as an error if a `VInt64`-encoded integer would not fit in the integral type given.

docs: draft spec for mast binary format #132

docs: draft spec for mast binary format #132

Conversation

bitwalker commented Feb 22, 2024

bobbinth left a comment

Choose a reason for hiding this comment

greenhat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fumuran commented Mar 8, 2024

bitwalker commented Mar 17, 2024

bobbinth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bobbinth Apr 14, 2024 • edited Loading

Choose a reason for hiding this comment

bobbinth Apr 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bobbinth commented Apr 15, 2024 • edited Loading

bobbinth commented Apr 15, 2024

bitwalker commented Aug 27, 2024

bobbinth Apr 14, 2024 •

edited

Loading

bobbinth Apr 14, 2024 •

edited

Loading

bobbinth commented Apr 15, 2024 •

edited

Loading