Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for stack switching proposal #10177

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

frank-emrich
Copy link
Contributor

@frank-emrich frank-emrich commented Feb 3, 2025

This PR adds support for the Wasm stack switching proposal. The explainer document for the proposal is here.

This means that the following instructions are added: cont.new, resume, suspend, switch, and cont.bind.
The instruction resume.throw, which is part of the proposal but relies on exception handling, is currently unsupported.

This was developed together with @dhil.

General implementation notes

In Wasm, continuations are represented by values of type (ref $ct), where $ct is a new composite type/heap type for continuations.
In the implementation, these are represented by values of type VMContObj. These are fat pointers, consisting of a sequence counter, and a pointer to a VMContRef. The latter type is used for the actual representation of continuations.
The sequence counter part of VMContObjs is used to check that every continuation value can only be used once.

The StoreOpaque is extended to contain a "stack chain": It indicates what stack we are currently executing on. Logically, this is a linked list of stacks, since each continuation has a parent field. The chain stored in the StoreOpaque always ends with a value representing the initial stack. This is the stack we were on when entering Wasm, which will usually be the main stack.

Memory Management

Currently, memory management is very basic: The StoreOpaque provides a new method for allocation a new VMContRef, and keeps track of all continuations created this way. Continuations are never deallocated at the moment, meaning that they live until the store itself is deallocated.

The stack memory used by each allocation (represented by type ContinuationStack) is always mmap-ed, with a guard page at the end. There is currently no support for growing stacks, or segmented stacks.

Backtrace generation

The existing backtrace generation mechanism is extended to be fully aware of continuations: When generating backtraces, the entire chain of continuation is traversed, not just the stack frames of the currently active stack/continuation.

Integration with GC

Integration with the GC proposal is limited, but should at least be safe:

  • Programs where continuation objects would be stored inside GC objects (structs, arrays, ...) are rejected
  • During GC, all stack frames of all continuations are inspected when looking for GC roots.

Entering/Exiting Wasm

Prior to this PR, there were two separate mechanisms that save and restore some state of the runtime when entering/exiting Wasm:

  1. The functions enter_wasm and exit_wasm in func.rs
  2. CallThreadState saves and restores (on drop) parts of the VMRuntimeLimits

This PR consolidates these two mechanism, because it requires some additional state to be updated and restored on enter/exit:
the type wasmtime::runtime::func::RuntimeEntryState now stores all of the required runtime state and ensures that it's restored when exiting Wasm.

Tags

Since the stack switching proposal extends the notion of tags introduced by the exception handling proposal, it adds the necessary machinery for importing, exporting, and defining tags.

Limitations

The limitations are as follows, some of which were mentioned above:

  • resume.throw needs to be implemented once exception handling is there
  • Full integration with GC and support for deallocating continuations
  • The only supported platform at the moment are unix-like x64 systems. I'm planning to change that soon as follow-up PRs, this won't require big changes to what's in this PR at the moment.
  • Neither Winch or Pulley are supported at the moment.

@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:x64 Issues related to x64 codegen wasmtime:c-api Issues pertaining to the C API. wasmtime:ref-types Issues related to reference types and GC in Wasmtime labels Feb 3, 2025
Copy link

github-actions bot commented Feb 3, 2025

Subscribe to Label Action

cc @fitzgen

This issue or pull request has been labeled: "cranelift", "cranelift:area:x64", "wasmtime:c-api", "wasmtime:ref-types"

Thus the following users have been cc'd because of the following labels:

  • fitzgen: wasmtime:ref-types

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@frank-emrich
Copy link
Contributor Author

frank-emrich commented Feb 4, 2025

There are currently a few test failures that I 'm hoping to get some help with:

  • I don't understand cargo vet well enough to know what's up, but the problem should be fairly benign: We are using memoffset in a different crate now, but from what I can tell, it has already been vetted for use in this project.
  • There are some build failures due to the new instructions not being supported on all platforms. In general, we've added a Wasm feature and a cargo feature for stack switching. My plan was that all stack switching tests would be skipped if the stack-switching cargo feature is not enabled. That's easy to do for the all test suite, but I'm not sure what the best approach is for the wast tests to achieve this. Am I understanding correctly that the wast test suite always currently runs all tests, independently from what cargo features are enabled?

As a general note, so far most of the stack switching code is compiled even if the cargo feature is disabled. It would be possible to get more aggressive with conditional compilation, but this requires some ugly special casing in backtrace.rs. Happy for suggestions if people would like to see more conditional compilation or are happy with keeping it minimal.

Also, I had to re-bless most of the disas tests since the layout of the VMContext changed due to the addition of some new fields. That has increased the surface area of this PR even more.

@fitzgen
Copy link
Member

fitzgen commented Feb 5, 2025

  • Am I understanding correctly that the wast test suite always currently runs all tests, independently from what cargo features are enabled?

That is correct, and we assert that tests that "should" fail do in fact fail. This way, if we fix bugs or expand support, we won't accidentally and silently miss out on tests for those things.

This should be fine for your stack switching tests, as long as you

  • mark the tests as "should fail" on supported platforms, and
  • return an error or some sort (rather than unsafely segfault or whatever) when processing Wasm that uses this proposal on an unsupported platform/configuration.

See

pub fn should_fail(&self, config: &TestConfig) -> bool {

@frank-emrich frank-emrich marked this pull request as ready for review February 12, 2025 18:07
@frank-emrich frank-emrich requested review from a team as code owners February 12, 2025 18:07
@frank-emrich frank-emrich requested review from alexcrichton and fitzgen and removed request for a team February 12, 2025 18:07
@alexcrichton
Copy link
Member

Oh sorry I should have left a note here but I think I forgot. I was out last week for a company offsite and I'm out this week for the CG meeting, but I plan on getting to this next week. I want to make sure I've got plenty of time to sit down and go through this, but wanted to give a heads-up that I'll be looking into it next week at the earliest.

Also to reiterate: thank you so much for this! Thanks for wrestling with CI as well, much appreciated!

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think I've seen about half of the files of this PR (excluding auto-updates in tests) so wanted to leave an initial round of comments/thoughts.

At a high-level everything here looks reasonable to me so far. I haven't dug into the actual implementation details of stack switching though and have for this been focusing on the integration points with the rest of Wasmtime. I left a lot of comments along the lines of "file an issue to point a comment to this" and I think it might make the most sense to have a single tracking issue for follow-up work on stack-switching. That can have a lot of various checkboxes for the logical work items remaining (e.g. contref in the gc heap, in the embedding API, windows support, etc).

Additionally at a high-level I want to also ask you what your own time allowance is for landing this? This is, as expected, a large PR which is going to take some time to land. I suspect it'll take me on the order of ~days of review to get through the the real "meat" of stack switching and I'll likely have comments from all that as well. If, however, you're running short on time for this that's ok too!

Basically what I want to ask is: do you have time allocated for the back-and-forth of review here? It's hard to predict exactly how long it will take but my gut is that it'll be a nontrivial amount of time. If you don't have time for this I think that's also totally ok. I'll probably still continue to review all this and I'll try to find time in the coming future to work on addressing many of these items myself.

In an ideal world this PR would be split up into chunks and landed piecemeal, for example parsing support, then compiling support, then maybe an instruction-at-a-time for each piece, etc. Splitting this up is no easy task though so I don't want to ask this lightly of you, and you also know much better than I about if it even could be split up and where it could be split. Part of this is that GitHub does not make reviewing large PRs easy (e.g. every time I add a comment it takes ~5 seconds for the comment box to appear), and overall it's easier to focus on smaller changes to ensure that as little as possible slips through the cracks.

Basically tl;dr; I've left a lot of comments below but it doesn't have to be you alone addressing them. If you're running low on time to allocate to this work I think we can try to be accomodating in assisting to landing this.

@@ -20,6 +20,7 @@ pub extern "C" fn wasm_extern_kind(e: &wasm_extern_t) -> wasm_externkind_t {
Extern::Global(_) => crate::WASM_EXTERN_GLOBAL,
Extern::Table(_) => crate::WASM_EXTERN_TABLE,
Extern::Memory(_) => crate::WASM_EXTERN_MEMORY,
Extern::Tag(_) => todo!(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind filing an issue about this and tagging with with // FIXME(#NNN): ...

Comment on lines +1019 to +1010
// FIXME(frank-emrich): If function-references depends on the gc
// feature, and stack-switching requires function-references, that
// probably means that stack-switching should depend on gc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to leave this as-is, the #[cfg] here is only to handle the conditionally defined functions on wasmtime::Config but it looks like wasm_stack_switching is unconditionally defined so it's ok to unconditionally thread through the boolean to there here. The wasmtime::Config can then handle validation of ensuring various dependent proposals are all enabled.

// Unsupported proposals. Note that other proposals have partial
// support at this time (pulley is a work-in-progress) and so
// individual tests are listed below as "should fail" even if
// they're not covered in this list.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is a copy/paste from a historical version, mind updating it to just say that pulley doesn't support stack-switching/exceptions at this time?

}
}

// FIXME(frank-emrich) Justify why this is afe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy/pasting the other unsafe impl VmSafe comments should be sufficient here, basically it's #[repr(C)] and the internals are all VmSafe

Comment on lines 266 to 274
/// The fields compiled code needs to access to utilize a WebAssembly
/// tag imported from another instance.
#[derive(Debug, Copy, Clone)]
#[repr(C)]
pub struct VMTagImport {
/// A pointer to the imported tag description.
pub from: VmPtr<VMTagDefinition>,
/// A pointer to the owning instance.
pub vmctx: VmPtr<VMContext>,
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that tags are "just" a type, while this follows the pattern of other imports I think this might be able to be deleted to use VMTagDefinition instead? Or basically otherwise only contain a VMSharedTypeIndex. Although I haven't reviewed the rest of this PR yet so I'm not sure if the pointers here are necessary for something else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A tag is more than a wasm type. It is a nominal entity and its unique identity is important. Basically we get a unique identity by picking backing on pointer equality.

I think we can get rid of the vmctx, potentially. Though we can discuss the best course of action here. I have pulled the tag stuff into its own patch which I will open tomorrow when I have had a chance to look at the tests.

Comment on lines +133 to +135
HeapType::NoCont | HeapType::ConcreteCont(_) | HeapType::Cont => todo!(
"Embedder support for continuations has not yet been implemented!"
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This location unfortunately can't easily be switched to returning a Result so panicking makes sense to me, but mind tagging this with // FIXME(...) pointing to an issue?

@@ -188,6 +188,8 @@ impl Table {
ty => unreachable!("not a top type: {ty:?}"),
}
}

runtime::TableElement::ContRef(_c) => todo!(), // TODO(dhil): Required for the embedder API.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like above for globals the panic here is fine for now but this'd be good to tag with a // FIXME pointing to an issue

Comment on lines 29 to 37
pub(crate) fn ty(&self, _store: impl AsContext) -> TagType {
todo!()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is just an internal method it might make sense to delete this and update callers to not use it?

Comment on lines +283 to +285
HeapType::NoCont | HeapType::ConcreteCont(_) | HeapType::Cont => {
unimplemented!()
} // TODO(dhil): Need to do this for the embedder API.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind tagging this with a // FIXME pointing to an issue?

@frank-emrich
Copy link
Contributor Author

frank-emrich commented Feb 17, 2025

@alexcrichton

Thanks so much for having a look at this already!

Regarding my own time commitment for this PR: I'm starting a new job in mid March, so things are looking as follows:

  • Until end of February: Focusing exclusively on getting this PR in shape
  • First two weeks of March: Working on this PR, but some distractions due to moving, etc.
  • From mid March: There will be an initial period of me settling in at my new job during which I can't do much on this. After that the Wasmtime/stack switching work will not part of my day job anymore, but I'll still be around to help out, address bugs, etc. I also have a few things on top of the current PR already done or planned that I want to add at that stage (aarch64 support, Windows support, better DWARF support, ....).

@dhil should also be able to help out here and there.

In an ideal world this PR would be split up into chunks and landed piecemeal, for example parsing support, then compiling support, then maybe an instruction-at-a-time for each piece, etc.

I think there was just a bit of a miscommunication between me and @dhil. I thought you wanted to see this as one big chunk. There's a simple way in which this PR could be split up, leading to a sequence of PRs as follows:

  1. Integrate tag support, with nothing stack switching related, yet.
  2. Add dummy field to VMContext with the size and position that the stack switching field will later occupy. This way, all the changes to the disas tests will happen in this PR. I'm not sure if the disas tests actually contribute to Github choking on this PR, but at least getting them out of the way will reduce the amount of conflicts coming in.
  3. Everything except the actual Wasm -> CLIF translation. This will be a big chunk where most of the integration of stack switching with the rest of Wasmtime goes.
  4. The Wasm -> CLIF translation. This will mostly be what's currently in crates/cranelift/src/stack_switching/, in particular instructions.rs. The latter is 2500 LOC, so reviewing that separately seems worthwhile. I may even be able to split this up further.

This is far from perfect because PR 3 and 4 will still be large and not really split along separate features of the stack switching implementation, but at least that split up would be easy for me to do. I may discover some more chunks that can be factored out into separate PRs on the way.

My only concerns about this is the following: Given that Github doesn't really support stacked PRs, we would work through these 4+ PRs linearly, making it difficult to point at why we need something in the next PR.
On the other hand, I could just put a branch somewhere that shows everything that's still missing at any point so you can have a look at how something will be used in the remaining PRs.

Please let me know what you think. I will also make sure to work through the comments you already added to this PR so they don't get lost if we do decide to split this PR.

Edit: One concern I forgot to mention regarding splitting this PR up is testing: We only have a few scattered unit tests, most of our testing happens through whole Wasm programs. That means that if we split the PR, we will only be able to add tests in the end, when all pieces are in place.

@alexcrichton
Copy link
Member

Regarding my own time commitment for this PR

Ok that all sounds good, thanks! (very helpful to set expectations!)

I think there was just a bit of a miscommunication between me and @dhil.

Oh no that's on me. I remember telling @dhil that and I think I was just basically wrong, so that's on me for leading y'all astray. Given the size of this (not just the auto-updated tests but also the scale of the implementation) along with the subtelty of the implementation in retrospect I just didn't account for that and misled y'all. If possible I think the best way to handle this would be to peel off PRs from this one and submit them independently.

Overall my goal is to ensure that we get a good chance to review everything coming in and reducing the chance of something being lost by accident due to being overlooked. For coordination what I would imagine is that it's completely reasonable to have lightly, if at all, tested intermediate states. Additionally it's totally reasonable to leave a "hole" in the code with something like // more coming soon ... (e.g. making a hole in the VMContext for the stack-switching state necessary for a future PR). Let's keep this PR open so it can be cross-referenced if needed, and this can be rebased occasionally as smaller pieces are landed.

Does that seem ok? It's a similar strategy we're taking for landing async component model support in Wasmtime where it's landing piece-by-piece. Each piece is not 100% thoroughly tested but the "final PR" has lots of tests. We try to keep track of outstanding TODO items in issues occasionally and otherwise are ok deferring work to a future PR which has more tests and more fully justifies a prior design.

Your sequence of PRs makes sense to me as well. I'd mostly defer to y'all for pieces to split up as you're the most familiar with the implementation, and if something seems too big I can bring it up during the review but by naturally splitting this up I think most slicing/dicing will be quite reasonable.

@frank-emrich
Copy link
Contributor Author

Overall my goal is to ensure that we get a good chance to review everything coming in and reducing the chance of something being lost by accident due to being overlooked.

Yes, absolutely. I'm happy to do the necessary chopping. I'll create a tracking issue describing the bigger picture and where we currently are regarding the PRs that have and haven't landed at any time.

The outline of PRs I mentioned in my previous comment will still lead to fairly large ones, but I will see if there's potential to split this up further and wait for your feedback once I've split off a candidate for a PR.

Also, sorry for making you wade through lots of todo!(), unimplemented!() and other little things that we hadn't resolved before creating this PR.

@alexcrichton
Copy link
Member

Sounds good! It's ok to skip the step of writing out a plan in an issue and then executing on the plan, we can work on making sure things are documented before you wind down but up until then it's ok if it's just in a few heads. In that sense no need to exhaustively document "here's the round peg that's gonna fit in that round hole" and we can just have one issue for tracking known items that are already in-tree.

And please no apologies necessary! Landing big changes I've found is always a bit of an art and if anything I'm the one who led you astray by being overconfident!

@github-actions github-actions bot added the fuzzing Issues related to our fuzzing infrastructure label Feb 19, 2025
Copy link

Subscribe to Label Action

cc @fitzgen

This issue or pull request has been labeled: "fuzzing"

Thus the following users have been cc'd because of the following labels:

  • fitzgen: fuzzing

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@frank-emrich
Copy link
Contributor Author

Converting this back to a draft now because it will be split up into multiple PRs.

Over the next few weeks, I expect to rebase this PR whenever parts of it have landed. This way, this PR will always contain whatever is left.

I've also created issue #10248 to track the overall status. I will also use that issue in various FIXMEs for missing features.

@frank-emrich frank-emrich marked this pull request as draft February 19, 2025 23:17
This was referenced Feb 20, 2025
@frank-emrich frank-emrich force-pushed the stack-switching branch 2 times, most recently from dd3f75a to 52b8759 Compare February 20, 2025 23:07
@github-actions github-actions bot added wasmtime:api Related to the API of the `wasmtime` crate itself wasmtime:config Issues related to the configuration of Wasmtime labels Feb 21, 2025
Copy link

Label Messager: wasmtime:config

It looks like you are changing Wasmtime's configuration options. Make sure to
complete this check list:

  • If you added a new Config method, you wrote extensive documentation for
    it.

    Our documentation should be of the following form:

    Short, simple summary sentence.
    
    More details. These details can be multiple paragraphs. There should be
    information about not just the method, but its parameters and results as
    well.
    
    Is this method fallible? If so, when can it return an error?
    
    Can this method panic? If so, when does it panic?
    
    # Example
    
    Optional example here.
    
  • If you added a new Config method, or modified an existing one, you
    ensured that this configuration is exercised by the fuzz targets.

    For example, if you expose a new strategy for allocating the next instance
    slot inside the pooling allocator, you should ensure that at least one of our
    fuzz targets exercises that new strategy.

    Often, all that is required of you is to ensure that there is a knob for this
    configuration option in wasmtime_fuzzing::Config (or one
    of its nested structs).

    Rarely, this may require authoring a new fuzz target to specifically test this
    configuration. See our docs on fuzzing for more details.

  • If you are enabling a configuration option by default, make sure that it
    has been fuzzed for at least two weeks before turning it on by default.


To modify this label's message, edit the .github/label-messager/wasmtime-config.md file.

To add new label messages or remove existing label messages, edit the
.github/label-messager.json configuration file.

Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:x64 Issues related to x64 codegen cranelift Issues related to the Cranelift code generator fuzzing Issues related to our fuzzing infrastructure wasmtime:api Related to the API of the `wasmtime` crate itself wasmtime:c-api Issues pertaining to the C API. wasmtime:config Issues related to the configuration of Wasmtime wasmtime:ref-types Issues related to reference types and GC in Wasmtime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants