Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Extend the behavior of load/store ops to allow addressing all elements of a word #1064

Open
Tracked by #1104
bitwalker opened this issue Sep 5, 2023 · 6 comments
Milestone

Comments

@bitwalker
Copy link
Contributor

bitwalker commented Sep 5, 2023

Background

I'm currently finishing up portions of the code generator in the compiler for Miden IR, and something came up as I was working through how to translate memory operations in the IR (which is byte-addressable) to Miden Assembly (which as we all know is word-addressable).

In particular, I noticed that the mem_load and mem_store instructions, which load/store an individual field element, rather than a word, only work for the first element of the word at the given address. Up until now, I had been operating under the assumption that those instructions could load/store any element of the word, but had missed that the addresses are word-aligned, thus making it (at least at first glance) impossible to represent a load for, say, the 3rd element of a word. In short, this significantly reduces the usefulness and generality of these instructions.

Proposal

I would like to suggest a change to these instructions that not only addresses this gap in functionality, but is backwards-compatible to boot.

The following are true of the current semantics:

  • The address operand for these instructions is a field element
  • The maximum address allowed is 2^32-1, which is in the u32 range. The ops in question trap if the address is out of this range.
  • If we were to u32.split the address operand, the high bits would be all zero, and the low bits would hold the address

I'm proposing the following changes:

  • The number and type of the operands remains the same
  • The address operand is treated as a pair of u32 values, and is split using the same rules as u32.split.
  • The previously unused (but zeroed) high bits, will now be used to represent an offset from the base of the word at the address represented in the low bits. The instructions will trap if this offset is anywhere outside of the range 0-3.

Without this change, to perform the equivalent of mem_load on the 3rd element of a word, you'd need to issue a sequence of instructions like padw, push.ADDR, mem_loadw, drop, drop, swap, drop. With this change, you would use push.ADDR, add.0x100000003, mem_load. That's, by my count (if we assume that mem_load goes from 1 cycle to 2 cycles to account for the u32 split), an improvement of 6 cycles (10 vs 4) for a basic operation that is going to occur very frequently in programs with any local or global memory allocations.

The syntax in MASM could be extended for the version which accepts an immediate, such that the current syntax (e.g. mem_load.0x0) is equivalent to specifying a constant offset of 0. This would be extended to accept a constant offset as well (e.g. mem_load.0x0.3).

Potential Issue(s)

The primary issue I can think of, if I'm trying to find ways to break this, is that distinguishing between an out-of-bounds address and a valid address + offset pair, becomes impossible. The case where an out-of-bounds address appears to be an in-bounds address with an invalid offset, isn't really important to distinguish, as both will trap as invalid. However, it is possible, albeit unlikely, for an out-of-bounds address to appear as a valid in-bounds address + offset pair, in which case a load or store that would have previously trapped, will now read/write some arbitrary location in memory.

We could defend against this by bit-packing a 24-bit CRC code in the high-bits after the offset, which must be checked when the offset is non-zero, but whether it is worth imposing such a cost on every load/store really depends on how important it is that we catch such cases. Personally, I think it would be best to specify that the behavior is undefined in such cases, just like there is little that protects you on a more typical system from writing past the end of an array - it might trap, or it might not.

Alternatives

The proposed changes are essential in my opinion, because without them, it is exceedingly awkward to treat memory as word-addressable. The primary alternative is to instead treat memory as felt-addressable. There are a couple of problems with this:

  • The amount of memory allocated for a program which discards 75% of each word allocated will in the worst case be 3x larger than an equivalent program that can make full use of each word.
  • It becomes much more expensive to load values larger than a field element, requiring 2x the cycles just to perform the loads, not counting the additional ops required to compute each element's address.
  • You lose the symmetry of being able to load, operate on, and then store word-sized values using the native instructions. You only really remain unaffected with regard to the word-oriented instructions where the word is already on the operand stack, but getting words on and off the stack are much more awkward.

--

I should note that this largely an issue which affects compilers, where fewer (or even no) assumptions can be made about where a given value is located in memory, so a consistent memory model is important. That said, even with hand-written assembly, you'd find yourself in a situation where you are either forced to waste most of a word, or required to emit expensive instruction sequences to read out a word and then extract what you need from it.

@bobbinth
Copy link
Contributor

bobbinth commented Sep 7, 2023

Thank you for such a detailed proposal! Overall, I think this would add very nice additional capability to the VM (i.e., having fully element-addressable memory could be useful in many circumstances). A few preliminary thoughts:

Implementation within the VM

I think implementing this within the VM will be a quite tricky - but I also haven't spent a lot of time on this - so, maybe there are good solutions which I haven't considered yet.

The main issue is that we actually always have to read a full word from a memory address. It is just for cases of mem_load we put the other 3 elements into helper registers (see here). So, to select a specific element from the word, we'd need to do the following:

  1. Put all 4 elements into helper registers.
  2. Split the address into 2 parts - this would require putting lower limb into another helper register.
  3. Represent the upper limb with 2 binary registers (e.g., [0, 0] means 0, [0, 1] means 1 etc.). Then we'd be able to use these registers as selectors for selecting a specific value.

Summarizing the above means that we need 7 helper registers, but currently we have only 6. I have some thoughts on how we can increase the number of helper registers to 8, but this would require a fairly significant refactoring of the decoder.

Another potential complication is that selectors would be of degree 2, and thus, the constraint for populating the top stack slot would be of degree 3. This means that MLOAD and MSTORE operation flags cannot have degree greater than 6. This should't be a problem as we currently have some empty slots for degree 5 operations. But this is still something to consider for the future.

Instruction semantics

If I understood everything correctly, the semantics you propose would be useful when addresses are static (or known at compile time). But for dynamic addresses, we'd still run into problems. For example, if we wanted to iterate over a sequence of elements in a loop, we wouldn't be able to easily increment memory address. So, it seems to me the usefulness is relatively limited.

One other idea is to treat memory as if it was element-addressable, and then impose alignment requirements on mem_loadw and mem_storew operations. So, for example:

  • mem_load.0 would load the first element of the first word; mem_load.1 would load the second element of the first word, mem_load.4 would load the first element of the second word etc.
  • mem_loadw.0 would load the first word, mem_loadw.4 would load the second word, but mem_loadw.1 would be illegal.

I think this would be much cleaner, but there are two main downsides to this approach:

  • We still need to solve the implementation problem I described above.
  • This would probably break most programs which use memory operations.

@bitwalker
Copy link
Contributor Author

If I understood everything correctly, the semantics you propose would be useful when addresses are static (or known at compile time). But for dynamic addresses, we'd still run into problems. For example, if we wanted to iterate over a sequence of elements in a loop, we wouldn't be able to easily increment memory address. So, it seems to me the usefulness is relatively limited.

I don't think that's necessarily true, since the element offsets for all 32-bit addresses are constant, i.e. adding 0 << 32, 1 << 32, 2 << 32, or 3 << 32 to any address gives you the "paired" address for the given element of that word.

If you wanted to traverse every element of n sequential words, completely dynamically, you'd do something like the following (using pseudocode):

let base_addr: *const Word = <the address in memory of the first word, doesn't have to be constant>;
for i in 0..n {
    let word_offset: u64 = i / 4;
    let element_offset: u64 = (i % 4) << 32;
    let word_addr: u64 = base_addr as u64 + word_offset;
    let element_addr: u64 = word_addr | offset;
    let element = *(element_addr as *const Element);
    ...do stuff with element..
}

That's useful in quite a few scenarios, certainly for those in which I was intending to make use of the proposed extension.

One other idea is to treat memory as if it was element-addressable, and then impose alignment requirements on mem_loadw and mem_storew operation

I'm definitely a fan of this, as it is much cleaner right from the get go, and is similar to how on, say x86-64, certain instructions impose additional alignment requirements on their operands (e.g. SSE2 instructions), though as you mentioned it breaks backwards-compatibility. That said, I'm not sure to what extent the memory ops are used currently, so perhaps it isn't too catastrophic. I would think memory gets used pretty frequently, for precomputed constant tables and things like that, but I have no idea.

We still need to solve the implementation problem I described above.

Is that true if addresses now refer to elements not words? I would think in that case, there is no longer a need for an offset, leaving things unchanged at the instruction level. Is it because the alignment requirements that would be imposed on mem_loadw and mem_storew require additional helper registers, much in the same way my proposal would?

@frisitano
Copy link
Contributor

Having native element addressable memory support at the VM level seems the superior option to me. There are a number of instances in the transaction kernel which would benefit for this functionality.

@bobbinth
Copy link
Contributor

If you wanted to traverse every element of n sequential words, completely dynamically, you'd do something like the following (using pseudocode):

I didn't write out the actually assembly code for this, but just doing a rough estimation I think we'd incur extra 6 - 7 cycles per element read in this code. This is better than 9 - 10 cycles per read using padw, push.ADDR, mem_loadw, drop, drop, swap, drop - but it is not dramatically better. So, there is still a question in my mind whether the extra effort is worth ~30% improvement.

I'm not sure to what extent the memory ops are used currently, so perhaps it isn't too catastrophic.

Quite a few modules currently use memory operations, but almost all of them are written (and maintained) by us, and most of them a pretty small. There are two exceptions:

  • Transaction kernel which is fairly complex and relies heavily on memory. Memory access there are fairly well encapsulated and @frisitano believes that it shouldn't be too difficult to move to the new memory model.
  • Recursive verifier which is also fairly complex and also relies on memory quite a bit. @Al-Kindi-0 also believes that we could migrate it to the new memory model (though effort may be more significant than with transaction kernel).

Is that true if addresses now refer to elements not words?

No, addresses right now refer to words and we can read/write from/to memory only full words. The way we get around this for mem_load and mem_store instructions is by putting the things which don't end up on the stack into helper registers. So, we still read (or write) a full word, but part of the word goes into (or comes from) the helper registers.

As a side note, now that I look at the constraints of the mem_store operation it seems to me that there might be a soundness bug there (i.e., a malicious prover could inject incorrect values into the temporary registers, and overwrite the other 3 elements at a given address with arbitrary values). @Al-Kindi-0 - what do you think? If so, we might have to move back to the approach we used previously when each memory row in memory chiplet contained both old and new values.

Combing back to the design for new mem_load and mem_store operations. Their transition diagrams could look like so:

image

and

image

In the above:

  • $addr$ is the element address.
  • $waddr$ is the word address computed as $waddr = addr / 4$.
  • $s_0$ and $s_1$ is a binary encoding of the element index within a word. Thus, we can impose a constraint for valid address decomposition as follows: $addr \cdot 4 + 2 \cdot s_1 + s_0 = waddr$. In addition to this, we'll need constraints to enforce that $s_0$ and $s_1$ are binary, and that $add$ is a u32 value (not sure about the best way to do this yet).
  • $val_0, ..., val_3$ are the values currently stored at $waddr$.

In the case of MLOAD, the constraint for $val$ would be:

$$ val = (1 - s_0) \cdot (1 - s_1) \cdot val_0 + s_0 \cdot (1 - s_1) \cdot v_1 + (1 - s_0) \cdot s_1 \cdot v_2 + s_0 \cdot s_1 \cdot v_3 $$

The constraints for MSTORE seem to be somewhat more complicated - so, I'll need to think more about them. But in either case, we need to have at least 1 extra helper register.

@bitwalker
Copy link
Contributor Author

...I think we'd incur extra 6 - 7 cycles per element read in this code. This is better than 9 - 10 cycles per read...but it is not dramatically better. So, there is still a question in my mind whether the extra effort is worth ~30% improvement.

I agree that the savings are relatively minimal, though I do think it's a matter of frequency - if such instructions occur frequently, the cycles saved start to add up fast for a program with any significant runtime. That said, the design here is a compromise for the sake of backwards-compatibility, and is really more about ergonomics/uniformity (whether the assembly is hand-written or compiler-generated). Overall though, I much prefer the approach of making memory element-addressable in general, since that seems like an option we can actually consider. The difference in cycle count for equivalent memory accesses between the two models would be night and day I think, since it removes a lot of complexity around accessing individual elements. It does impose a bit more overhead when accessing words, but only in situations where the alignment of the address is not known, and must be checked at runtime. In practice, accessing values smaller than a word in memory is going to be far more common than >= word-sized values (at least for code compiled through Wasm), so favoring those operations is going to have a bigger performance impact.

No, addresses right now refer to words and we can read/write from/to memory only full words. The way we get around this for mem_load and mem_store instructions is by putting the things which don't end up on the stack into helper registers. So, we still read (or write) a full word, but part of the word goes into (or comes from) the helper registers.

That makes sense, though I guess what I was wondering, is if that changes at all with element-addressable memory vs word-addressable memory. In other words, would that change just make memory look element-addressable, but preserve how memory loads/stores work in the VM today (i.e. word-oriented, with some changes to the constraints as you described)? Or would it also involve reworking the implementation of memory in the VM so that it is element-oriented rather than word-oriented?

Obviously I'm looking at this through a very narrow lens, so the importance of doing things like memory accesses in word-sized units might have far reaching implications I'm not thinking about, but in an element-addressable world, it seems out of place to implement memory accesses in terms of words, particularly if mem_load/mem_store would require multiple helper registers that go essentially unused (beyond their role in the constraint system). It would obviously still be important to enforce word-alignment for mem_loadw/mem_storew, but I'm surprised it would be necessary to have the memory itself be word-oriented, if the word-alignment restriction ensures that there is always a full word at a given address.

Anyway, I want to avoid derailing the conversation here just because I don't have a solid grasp of how some of these pieces fit together in the VM internals, so if I'm missing obvious stuff here (or, obvious if you know the internals well), feel free to ignore the above. I'm certainly curious about the relationship between the memory instructions and their implementation in the VM (and how that affects the constraints), but I imagine it's probably been discussed/documented somewhere already, and I just need to go do some reading.

@bobbinth
Copy link
Contributor

That makes sense, though I guess what I was wondering, is if that changes at all with element-addressable memory vs word-addressable memory. In other words, would that change just make memory look element-addressable, but preserve how memory loads/stores work in the VM today (i.e. word-oriented, with some changes to the constraints as you described)? Or would it also involve reworking the implementation of memory in the VM so that it is element-oriented rather than word-oriented?

The approach I described above would still keep the internal mechanics of memory word-oriented. So, basically, it would just look element-addressable.

Obviously I'm looking at this through a very narrow lens, so the importance of doing things like memory accesses in word-sized units might have far reaching implications I'm not thinking about, but in an element-addressable world, it seems out of place to implement memory accesses in terms of words, particularly if mem_load/mem_store would require multiple helper registers that go essentially unused (beyond their role in the constraint system). It would obviously still be important to enforce word-alignment for mem_loadw/mem_storew, but I'm surprised it would be necessary to have the memory itself be word-oriented, if the word-alignment restriction ensures that there is always a full word at a given address.

The word-oriented design is done primarily for efficiency reasons. For the use cases we are targeting, it will be very common to read/write memory as full words rather than individual elements. For example, nodes in a Merkle tree are words, output of the RPO hash function is a word (and inputs are 3 words), assets in Miden rollup are words etc. etc. So, we frequently have a need to read or write a full word (or even two consecutive words), and I wanted to make this as efficient as possible.

Consider mem_stream for example. This instruction reads two consecutive words and overwrites top 8 elements of the stack with the result in a single cycle. Implementing this instruction with element-addressable memory would not be possible in our current design (or we'd need to add extra columns which would slow down the rest of the VM). And emulating this instruction with individual loads would probably require ~30 cycles.

@bobbinth bobbinth added this to the v0.8 milestone Oct 12, 2023
@bobbinth bobbinth modified the milestones: v0.8, v0.9 Jan 11, 2024
@bobbinth bobbinth modified the milestones: v0.9, v0.11.0 Jul 24, 2024
@bobbinth bobbinth modified the milestones: v0.11.0, v0.12 Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants