Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: draft spec for mast binary format #132

Closed
wants to merge 1 commit into from
Closed

Conversation

bitwalker
Copy link
Contributor

This PR adds a new document which describes the new in-memory representation we've been discussing recently, as well as provides a draft specification for the MAST binary format. This isn't meant to be the canonical source for the specification or anything, I just found it convenient to draft it here, and figured it would be a good way to get some feedback before opening a more formal proposal in the miden-vm repo.

@bobbinth You're going to be the one most interested in what's here. Feel free to skip over the section that details aspects of the in-memory representation. Nothing has really changed there since our discussion over in the miden-vm repo, but I tried to write up a bit more of a summary/intro document to make sure I understood all the pieces more or less. The primary thing I'd like to get your opinion on is the specification of the binary format itself. It's not a formal spec, but it is precisely defined.

@greenhat Also interested in any notes you have, a lot of the details here are quite similar to the package format we're also discussing, so it should be pretty familiar already. The two formats are intended to have some synergistic properties, so that's part of the reason they are so similar.

Anyway, once you both feel like this is ready to propose in the miden-vm repo, I'll open an issue for it there. Consider this an early draft.

@bitwalker bitwalker added documentation Improvements or additions to documentation todo This is a task related to an incomplete feature labels Feb 22, 2024
@bitwalker bitwalker self-assigned this Feb 22, 2024
Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This looks great as well. I've also just skimmed though the document for now, but left a few comments inline.

docs/src/appendix/mast.md Show resolved Hide resolved
docs/src/appendix/mast.md Show resolved Hide resolved
docs/src/appendix/mast.md Outdated Show resolved Hide resolved
docs/src/appendix/mast.md Show resolved Hide resolved
docs/src/appendix/mast.md Show resolved Hide resolved
docs/src/appendix/mast.md Outdated Show resolved Hide resolved
Copy link
Contributor

@greenhat greenhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great overall! Please, check my notes.

docs/src/appendix/mast.md Outdated Show resolved Hide resolved
docs/src/appendix/mast.md Outdated Show resolved Hide resolved

pub struct OpBatch<const BATCH_SIZE: usize = 8> {
ops: Vec<Operation>,
groups: [Felt; BATCH_SIZE],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot figure out the purpose of the groups (especially the use of Felt). Could you please clarify this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary reason for this is the way the VM decodes programs. Since Felt is the smallest unit the VM understands natively, we encode up to 9 operations into a single Felt and we call these an "operation group". Then, up to 8 groups can be absorbed by a single permutation of the hash function we use (RPO) and the VM needs to execute additional operations between decoding more than 8 groups (i.e., a single batch). At the very high level, when we execute a basic block, the VM actually executes the following operations (the basic block consists of 3 batches):

SPAN
<operations from batch 0>
RESPAN
<operations from batch 1>
RESPAN
<operations from batch 2>
END

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks for the detailed explanation!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also confused by this when reviewing the implementation of things in miden-core, but after re-reading things several times I was able to arrive at more or less what @bobbinth explained here. As an aside, your explanation here is actually better, and more succinct than anything I came across, we might actually want to add it to the code in miden-core as a brief summary for future readers!

docs/src/appendix/mast.md Outdated Show resolved Hide resolved
Comment on lines +316 to +319
/// The offset in the trailing data section of `EncodedMastForest`
/// where the data for this node starts. If this node has no data,
/// then the value will always be zero. Otherwise, the data should
/// be decoded starting at this offset, the specifics of which
/// depend on the node type
offset: DataOffset,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the node's data is the "first" in the buffer? It has a zero offset, but we treat it as "no data".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What actually determines if the node has data or not is the MastNodeType. So even if offset is 0, if the node type has associated data, we will interpret offset as a valid offset.

The comment here is referring to the fact that if there is no associated data, the value will always be zero, regardless of the current offset in data where the next write will occur. For example, if the current offset in data is 128, and we're encoding a MastNodeInfo for a node that has no associated data, then we will not store 128 as the offset in the encoded MastNodeInfo, we will store 0.

In reality, either choice would be fine, but 0 has the smallest possible encoding, and I didn't want to waste bytes encoding an offset that is never used.

}
```

The only thing to note here is that we want the `MastNodeInfo` type to be fixed-width, to enable indexing. We also want to keep the encoding compact, but efficient. As a result, we are making a slight tradeoff with the encoding of this type to make traversal fast, by inlining references to other nodes. This means that there are padding bytes introduced for node types which do not have such references, and which cannot take advantage of the padding to store a subset of their data. This is estimated to be a worthwhile tradeoff, but isn't based on any benchmarks, just gut feeling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming it goes on-chain, I'd rather optimize it heavily for size. We can always offload slow("analytical") processing to an indexer, but the on-chain format is almost set in stone and would be a constantly growing "headache". Keeping in mind that the code size in smart contracts tends to be rather small, so we are talking kilobytes and not megabytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My expectation is that we are going to be compressing the serialized binary output of the encoder, and decompressing prior to feeding it into the decoder. Thus making our binary format highly structured is going to make it easier to compress to a very tiny size.

However, I do have two other possible encoding schemes in mind, which are variations of this one that use different techniques for serialization/deserialization with different tradeoffs. I want to evaluate this one first, to see the sizes of the compressed binary output before evaluating the others, as they are more complex, and thus produce smaller output, but that may make them less amenable to compression, and slower to encode/decode.

One of the small modifications we could easily make to this format though, is to delta-encode things like indices/offsets, as they are monotonically increasing. This could keep virtually all such indices/offsets as small as 1 byte for any given program. It also doesn't add much in terms of serialization/deserialization overhead.

I'm not particularly worried about padding bytes though, they essentially disappear under even the most trivial compression schemes. I'm much more interested in how to exploit the structure of MAST programs to make the encoding both small and compressible to the maximum degree possible.

@Fumuran
Copy link

Fumuran commented Mar 8, 2024

Looks great! I don't think I can say something in addition to what was written above, but for me it looks awesome.

@bitwalker
Copy link
Contributor Author

I've updated the draft document with the feedback everyone has provided so far. A few things were more questions/clarifying of things, and so I've left those comments unresolved for subsequent review. Let me know if you have further additional feedback!

I should also note that 0xPolygonMiden/miden-vm#1277 makes changes to Miden Assembly that will make transitioning to MAST as the binary format quite trivial (once the actual encoder/decoder is written), since all invocation targets (i.e. the callee of instructions like call) are resolved to MAST digests, including exec (by using PROXY nodes, what is called External in this draft proposal). As a result, we won't have to make any changes to the assembler to implement this, aside from emitting MAST as the binary form rather than MASM.

Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thank you again for such a comprehensive write up! I left some comments inline - but most are minor (and some may be premature optimizations).

One thing I wanted to mention more specifically (and this also referenced in one of the comments), I think we actually need 2 variants of the format: "fast" and "compact". The current spec defines the "fast" format - which I think is the right one to start with (the "compact" variant we can define much later).

The reason why I think we need to differentiate between the two explicitly is because they have different trust assumptions and structures. Specifically:

  1. The "fast" variant is trusted. That is, the deserializer would trust that the serializer encoded everything correctly. The main benefit of this is that we don't need to do any hashing during deserialization (e.g., we trust that digests included with each node are correct).
  2. The "compact" variant is un-trusted. That is, the deserializer would need to compute hashes of all nodes to verify that they are correct. This introduces significant computational overhead, but it does allow removing the digest field from MastNodeInfo (excluding the External variant) - which, I expect, would reduce the size of the binary quite a bit (primarily because digests are basically uncompressible data).

The "compact" variant would be the one which would be distributed between parties which don't trust each other. But once I download the "compact" variant from someone, I can convert it to "fast" variant and store it on disc - and so the VM would work primarily (or maybe even exclusively) with the "fast" variant.

As mentioned above, we don't need to worry about the "compact" variant now, with the only exception being potential addition of a binary flag to indicate fast/compact encoding for a given file.


#### Variable-Width Integers

First, let's introduce a type that we will use throughout the rest of this spec, `VInt64`. It represents a variable-width, unsigned, little-endian 64-bit integer. It is encoded as one or more bytes, where each byte has 7 bits to represent the value, with the last bit used to signal when another byte is needed to represent the encoded value. In this case, unless otherwise noted, we treat it as an error if a `VInt64`-encoded integer would not fit in the integral type given.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the explanation of the VInt64 is not exactly correct here (i.e., I don't believe it uses a "continuation" bit in each byte but rather encodes the total number of bytes using trailing zeros of the first byte).

Comment on lines +237 to +243
/// The format version, initially all zeros
///
/// If future modifications are made to this format, the version
/// should be incremented by 1. A version of 0b111 is reserved
/// for future extensions that require extending the version
/// field itself, but should be considered invalid for now.
version: [u8; 3],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will probably need 2 variations of the format - one for "fast" encoding/decoding and another one is for "compact" encoding (see overall review comment). So, I'm wondering if, in addition to the version field, we also need some binary flag which indicates whether this file is "fast" or "compact".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other question re version: would this implicitly encode the VM version as well? Specifically, let's say we modify the instruction set in some way - would the format version be incremented? If not, would it make sense to add something like miden_version field here (instead of it being in the package definition)?

Comment on lines +255 to +257
/// There is one byte for every 8 nodes in the set, rounded to the
/// nearest multiple of 8.
roots: [u8; (self.node_count / 8) + (self.node_count % 8 > 0 as usize)],
Copy link
Contributor

@bobbinth bobbinth Apr 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if using 1 bit per node to indicate that a node is a root is a premature optimization at this point. We are basically assuming that the proportion of roots to nodes is going to be relatively high. But, if for example, we have 1000 nodes and only a dozen roots, recording them as a list of node indexes is going to be about 5x more compact (125 bytes in case of 1 bit per node vs. ~25 bytes in case of recording indexes for a dozen of nodes).

Overall, I'm not opposed to this approach, but I do wonder if we should see some evidence first that this is indeed an improvement over the simpler approach.

Comment on lines +327 to +335
#[repr(u8)]
pub enum MastNodeType {
/// For more efficient decoding, and because of the
/// frequency with which these node types appear, we
/// directly represent the child indices for `Join`,
/// `Split`, and `Loop` inline. This incurs 8 bytes
/// of padding for other node types, except `Block`
/// which takes advantage of this to also inline some
/// of its data
Copy link
Contributor

@bobbinth bobbinth Apr 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a premature optimization on my part, but I wonder if we should always encode MastNodeType as 8 bytes (rather than the current 9). Basically, the first 4 bits could be used to encode the variant, and then we could still encode 2 node indexes with 30 bits each. This does cap the maximum number of nodes at $2^{30}$, but I don't think that's a particularly big concern.

On the other hand, having MastNodeType aligned on 8-byte boundaries may be beneficial.

Comment on lines +342 to +346
Block {
/// The maximum number of operations in a block
/// is always smaller than `u8::MAX`
len: u8
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is u8::MAX is the maximum operations in a block? I don't think we have such a limit. I think here we could probably use u32::MAX.

Comment on lines +406 to +416
/// A simple operation, with its opcode encoded in a single byte
Op { opcode: u8 },
/// An operation with an immediate argument.
///
/// Like `Op`, the opcode takes a single byte, and the immediate
/// is encoded using a variable-width integer. The immediate must
/// be converted to/from `Felt` when decoded.
OpWithImmediate {
opcode: u8,
imm: VInt64,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this distinction is actually needed. By reading the opcode we could tell whether this operation has an immediate or not. Currently, the only operation which has an immediate is the PUSH operation.

Comment on lines +388 to +405
/// Unlike most other type definitions here, this one
/// is not encoded using the usual Rust layout rules.
///
/// Specifically, we treat this as a variable-width type,
/// without the padding that would normally be introduced
/// to make smaller variants as large as the largest variant.
///
/// So each encoded operation starts with the tag byte, which
/// indicates which variant we're going to be parsing. We then
/// parse the data for that variant, and immediately look for
/// the next tag byte (if there are more ops to parse).
///
/// It is strictly for purposes of readability.
///
/// NOTE: Decorators are always associated with the next non-decorator
/// operation in the basic block being decoded.
#[repr(u8)]
pub enum EncodedOperation {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the comment explains exactly this, but I think we don't need a special enum to differentiate different types of operations and decorators. Specifically, we just need 1 byte to encode both the tag and the opcode.

This is possible because operations require only 7 bits to encode. So, we can say that if the first bit of a byte is 0, the remaining 7 bits encode an opcode; if the first bit is 1, the remaining 7 bits encode tag of a decorator.

This way, vast majority of EncodedOperation's would require 1 byte to encode (rather than 2 bytes).

@bobbinth
Copy link
Contributor

bobbinth commented Apr 15, 2024

One other thing I forgot to clarify: would things like source locations and other debug-related data go into the data section? Or these things are not really a part of this spec and would be handled separately (i.e., via the package format)?

@bobbinth
Copy link
Contributor

One other thing I forgot to clarify: would things like source locations and other debug-related data go into the data section? Or these things are not really a part of this spec and would be handled separately (i.e., via the package format)?

Actually, re-reading #129, I've answered my own question: debug data goes into the package but not into MAST.

@bitwalker
Copy link
Contributor Author

This has been implemented in 0xPolygonMiden/miden-vm, more or less, so I'm closing this PR since it was primarily meant for drafting the initial spec.

@bitwalker bitwalker closed this Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation todo This is a task related to an incomplete feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants