Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support WebAssembly memory instructions on big-endian platforms #2124

Closed
uweigand opened this issue Aug 10, 2020 · 10 comments
Closed

Support WebAssembly memory instructions on big-endian platforms #2124

uweigand opened this issue Aug 10, 2020 · 10 comments

Comments

@uweigand
Copy link
Member

Feature

Current wasmtime maps the WebAssembly memory instructions (t.load, t.store etc.) directly to Cranelift IR memory instructions (load, store, uloadN, etc.).

This causes problems on big-endian platforms, because the Cranelift IR instruction are implemented as native load and store instructions using the machine byte order, while the WebAssembly memory instructions are specified to use little-endian byte order always.

Now, I initially thought that one way to solve this problem could be to treat Cranelift IR memory instructions also as always little-endian by specification. However, that does not work, because there are many other uses of these instructions that do require native byte order.

Some examples of those include:

  • Memory accesses added by platform ABI code (implicit pointers for argument or return values), in particular if this needs to be compatible with native code.

  • Memory accesses to values prepared by trampoline code at the boundaries of VM native code and JITted code.

  • Memory accesses to parts of the VMContext that is also accessed by VM native code.

In addition, there are cases where -while not strictly necessary for correctness- it is preferable for performance reasons to use native byte order, e.g. for spill code, for accessing variables on the stack, when implementing code such as inlined copies for small memcpy etc.

So, I believe we need some way of representing both always-little-endian memory operations (used to translate
the WebAssembly instructions), and native memory operation (used for everything else).

Benefit

Enabling support for Wasmtime on big-endian platforms like IBM Z.

Implementation

My current implementation of this approach simply duplicates all Cranelift IR memory instructions to create always-LE versions. So in addition to "load" there is "load_le" etc. The full list is:
load_le
load_le_complex
store_le
store_le_complex
uload16_le
uload16_le_complex
sload16_le
sload16_le_complex
istore16_le
istore16_le_complex
uload32_le
uload32_le_complex
sload32_le
sload32_le_complex
istore32_le
istore32_le_complex

Advantages of this approach include:

  • Most code that creates load/store instructions can remain unchanged, the WebAssembly translator simply always uses the new instructions.
  • It's already implemented and working :-)

But there are disadvantages:

  • All back-ends must implement all the new instructions (usually by just mapping them back to normal loads/stores), or else the target will stop working.
  • Middle-end code changes operating on loads/stores (e.g. the code that recognizes and creates _complex operations) should be adapted or else we can get performance regressions.

Alternatives

There's various alternative ways this could be implemented:

A) Add an additional flag argument to load/store instructions that
specifies the requested byte order. A detail question is whether the flag is
little-endian vs. native
little-endian vs. big-endian
little-endian vs. big-endian vs. native

Advantages:

  • no new IR instructions required
  • existing back-ends could simply ignore the flag

Disadvantages:

  • it's still an IR change as that flag must be considered part of the IR (e.g. parsing IR, serialization ...)
  • All creators of loads/stores (including outside of cranelift, and possibly even outside of wasmtime!) must be updated. If there is no "native" flag setting, all those updates must include finding out the native byte order somehow.

B) Add an additional flag bit to the existing MemFlags
Advantages:

  • no new IR (should be covered by existing serialization ...)
  • can be ignored by existing back-ends
  • no change to (most) creators of loads/stores necessary

Disadvantages:

  • MemFlags can no longer be dropped, it becomes required for correctness to always preserve in in the middle end

C) Open-code the conversion in the WebAssembly translator
Only emit a plain "load" if the target is little-endian. Otherwise emit a load followed by a byte-swap (possibly followed by an extension). Vice-versa for stores. This would probably require addition of a new "bswap" Cranelift IR instruction, unless we want to open-code bswap itself as well (possible, but a bit tedious).

Advantages:

  • Only "bswap" as new IR element, can be ignored by back-ends for little-endian architectures and everywhere else.

Disadvantages:

  • No major ones I can see - this would be my preferred approach.
@uweigand
Copy link
Member Author

@cfallin @julian-seward1 - opened issue as discussed in our call today.

@uweigand
Copy link
Member Author

uweigand commented Nov 9, 2020

See the PR I just posted. This is not necessarily ready to go in as-is, it is intended to showcase a possible implementation.

The PR implements a variant of method (C) above, where I'm creating just two new Cranelift IR opcodes, LoadRev and StoreRev - load reverse and store reverse. (These map directly to instructions IBM Z has, and a number of other platforms have as well.)

Note that there are no extending load / truncating store versions of those (I'm not aware of any ISA that directly has them, and in any case any new back-end could just match the combination if necessary). Likewise, there are no _complex versions, those are no longer really used with new back-end anyway.

@fitzgen
Copy link
Member

fitzgen commented Nov 30, 2020

Thanks for exploring the design space and laying out our choices @uweigand.

I would personally be in favor of alternative A. I think having clif's semantics defined irrespective of the current target is valuable and will make hacking on the compiler easier.

(I think B is also a strong contender, but A will be easier from an engineering perspective, since we can leverage compiler errors to automatically tell us exactly which code needs updates, rather than manually hunting down places where mem flags are silently dropped and/or created with defaults.)

Although I don't strongly hold this opinion, I'm leaning towards only discriminating between big and little endian, not including a native endian option.

If there is no "native" flag setting, all those updates must include finding out the native byte order somehow.

We can expose the target's endianness relatively easily, since our target-lexicon crate already has this information, and the cranelift-frontend builders can implement a load-with-native-endian builder method.

https://docs.rs/target-lexicon/0.11.1/target_lexicon/enum.Endianness.html

@cfallin
Copy link
Member

cfallin commented Nov 30, 2020

+1 to the combination of two choices:

  • Only "big endian" and "little endian" as choices (no "native");
  • Make the current load-instruction builders choose the target's native endian.

In other words we want existing users of the builder API to continue to see the same semantics, and new builder methods can take an Endianness. Removing the "native" option at the CLIF level fully closes the target-specific semantics hole, which is nice!

@uweigand
Copy link
Member Author

uweigand commented Dec 2, 2020

I've made an initial attempt at implementing something along those lines here: https://github.com/uweigand/wasmtime/pull/1

Note that this is currently not ready to even attempt merging upstream, in particular because the addition of the extra endianness member to various clif instructions makes the InstructionData structure exceed its target size of 16 bytes.

So either we find some way to encode InstructionData more efficiently (but that would need advice from somebody much more familiar with Rust than I am ...), we accept the (significantly) memory consumption, or else we go back and try something else.

However, adding the explicit endianness member also makes the code more verbose, if you look at the patch above you'll see a large number of mostly "boilerplate" changes. As an aside, I have not been able to implement this suggestion:

Make the current load-instruction builders choose the target's native endian.

as is (or maybe I misunderstood it). The builders that are used by most of the code are the ones automatically generated by the "meta" machinery, and not only did I not find an easy way to provide a default value for any member, but the builders don't seem to even have the required information (they don't have access to the TargetIsa or triple or anything else to determine the default target endianness, unless I'm missing something). That said, changing the callers to provide an explicit endianness (usually retrieved from the TargetIsa) was mostly straightforward.

As a result of those issues with adding and explicit endianness member, I've been thinking of maybe going back to option B and encoding endianness with the MemFlags. This solves the InstructionData size problem, and also removes most of the boilerplate code changes. I was originally sceptically about option B due to the potential issue of introducing errors by accidentally dropping MemFlags during optimizations. However, I think this problem can be fixed: all builders of load/store instructions already make it mandatory to provide a MemFlags value. This can be either copied from another instruction, or else created afresh by starting from MemFlags::new() or MemFlags::trusted(). Now, if we simply added a mandatory endianness argument to all such MemFlags constructors, then we could easily enforce that every copy of MemFlags always contains target endianness information, and thus this cannot be silently dropped.

I'll see if I can come up with another implementation along those lines.

Finally, I'd appreciate any further comments on the patch above. In particular, there are some user-visible changes to the textual representation of clif instructions - please let me know if this approach looks good:

  • When dumping an instruction for debug purposes, I'm now always including endian information (by adding "little" or "big" to the output, alongside where MemFlags are printed).
  • When reading an instruction (e.g. for clif-util testing), the parser will accept "little" or "big" as well. If neither is present, the endianness will default to the ISA currently in operation (similarly for how register names are parsed); if no ISA is currently in operation, the parser will reject any memory instruction without explicit "little" or "big". (This happens just once in the existing filetests.)

@fitzgen
Copy link
Member

fitzgen commented Dec 2, 2020

As a result of those issues with adding and explicit endianness member, I've been thinking of maybe going back to option B and encoding endianness with the MemFlags. This solves the InstructionData size problem, and also removes most of the boilerplate code changes. I was originally sceptically about option B due to the potential issue of introducing errors by accidentally dropping MemFlags during optimizations. However, I think this problem can be fixed: all builders of load/store instructions already make it mandatory to provide a MemFlags value. This can be either copied from another instruction, or else created afresh by starting from MemFlags::new() or MemFlags::trusted(). Now, if we simply added a mandatory endianness argument to all such MemFlags constructors, then we could easily enforce that every copy of MemFlags always contains target endianness information, and thus this cannot be silently dropped.

This sounds very reasonable to me.

In particular, there are some user-visible changes to the textual representation of clif instructions - please let me know if this approach looks good:

This sounds good to me.

@uweigand
Copy link
Member Author

uweigand commented Dec 4, 2020

Here's an updated patch set along those lines: https://github.com/uweigand/wasmtime/pull/2

I'm still not completely happy, in particular because of the following two issues:

  • New-style back ends now attach MemFlags to machine instructions, which is used only to decide whether the emitted instruction can trap or not. Having endianness encoded into MemFlags can make this more tedious, in particular with test cases that create new MemFlags (e.g. emit_tests.rs). At this point, there's usually no TargetIsa available so the endianness needs to be hardcoded. (In any case it is later ignored anyway.) @cfallin you added this recently -- would it maybe be preferable to just attach a may_trap flag instead?

  • GlobalValue "load" instructions use somelike similar, but not quite identical to MemFlags; it is printed and parsed as if it were MemFlags, but in the data structure there's actually just a "readonly" flag. Should this now include endianness in printing and parsing? Should the data structure transition to a full MemFlags instead?

Comment on these questions as well as general comments on the approach are welcome - thanks!

@cfallin
Copy link
Member

cfallin commented Dec 4, 2020

At this point, there's usually no TargetIsa available so the endianness needs to be hardcoded. (In any case it is later ignored anyway.) @cfallin you added this recently -- would it maybe be preferable to just attach a may_trap flag instead?

Hmm, yeah, I think we may want to reconsider this. I had refactored things away from attaching a SourceLoc to every may-trap memory instruction explicitly, as that was becoming untenable, and plumbing the flags through was simple enough; but considering the bits that actually exist in the MemFlags, the trap-ness is the only thing that wouldn't otherwise translate to a machine-specific thing (e.g. alignedness should manifest in how the lowering chooses instructions, if at all, not in flags on one machine instruction).

In particular, the endianness flag at the CLIF level should translate to a different instruction sequence at the VCode level on most machines (excepting only a hypothetical machine that has a "swap bytes" bit on every load instruction). So it would be suboptimal to be able to represent meaningless flags at the VCode level.

An alternative we had considered was to instead make the trap-ness part of a new "instruction metadata", along with the safepoint bit, that lives alongside the MachInsts in the body rather than inside of them. I'm starting to warm up to this more; trap-ness is analogous to safepoint-ness in that both result in metadata attached to the MachBuffer during emission, and both are properties of the instruction that are known when it is given to the LowerCtx.

So I might suggest the following: (i) create a MachInstMetadata that has, for now, two flags (safepoint, may_trap); (ii) unify LowerCtx::emit() and LowerCtx::emit_safepoint() to emit_with_opts() and emit(), the latter forwarding to former with MachInstMetadata::default(); (iii) store the metadata alongside instructions; (iv) feed the metadata to EmitState where we currently attach the stackmap on safepoints.

That's a large enough refactor that, for now, I think your PR's approach is fine (keep MemFlags on instructions, add a bit of verbosity to emission tests); but eventually we can clean things up.

GlobalValue "load" instructions

Yes, these should also carry MemFlags, I suspect; and since the lowering happens during CLIF legalization, we can just pass the flags through.

@uweigand
Copy link
Member Author

uweigand commented Dec 8, 2020

That's a large enough refactor that, for now, I think your PR's approach is fine (keep MemFlags on instructions, add a bit of verbosity to emission tests); but eventually we can clean things up.

GlobalValue "load" instructions

Yes, these should also carry MemFlags, I suspect; and since the lowering happens during CLIF legalization, we can just pass the flags through.

Thanks @cfallin, I've now updated the patch to carry a MemFlags in GlobalValueData::Load and left the MemFlags on machine instructions for now. The updated patch is now posted as PR #2488 above.

@cfallin
Copy link
Member

cfallin commented Dec 14, 2020

Fixed with #2492.

@cfallin cfallin closed this as completed Dec 14, 2020
jlb6740 pushed a commit to jlb6740/wasmtime that referenced this issue Dec 15, 2020
WebAssembly memory operations are by definition little-endian even on
big-endian target platforms.  However, other memory accesses will require
native target endianness (e.g. to access parts of the VMContext that is
also accessed by VM native code).  This means on big-endian targets,
the code generator will have to handle both little- and big-endian
memory accesses.  However, there is currently no way to encode that
distinction into the Cranelift IR that describes memory accesses.

This patch provides such a way by adding an (optional) explicit
endianness marker to an instance of MemFlags.  Since each Cranelift IR
instruction that describes memory accesses already has an instance of
MemFlags attached, this can now be used to provide endianness
information.

Note that by default, memory accesses will continue to use the native
target ISA endianness.  To override this to specify an explicit
endianness, a MemFlags value that was built using the set_endianness
routine must be used.  This patch does so for accesses that implement
WebAssembly memory operations.

This patch addresses issue bytecodealliance#2124.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants