Show weirdness of Windows on CI #9688

alexcrichton · 2024-11-26T21:56:16Z

Work-in-progress while I get to this to the point where I can share with folks. Not intended for merge

alexcrichton · 2024-11-26T22:57:23Z

@jsturtevant if you don't mind and if you're willing we could use some more Windows-specific advice/help on this. For background on this I was working on #9675 recently and was having a tough time landing it as it was consistently failing on CI. The failure was only on MinGW and not on other Windows targets (e.g. MSVC) or other platforms (e.g. macOS or Linux). The actual PR itself turned out to be largely unrelated to the crashes on MinGW since the reproduction was quite simple.

I've setup the PR here to showcase what's going on. Specifically we have a bit of C code in a helpers.c file around calling setjmp. We can't call setjmp from Rust as the Rust compiler has no concept of a function that returns twice (e.g. setjmp), so we have to rely on C for this. The current version I linked here is the one with the workaround I landed in #9675. Basically we want to write a function that looks like:

  platform_jmp_buf buf;
  if (platform_setjmp(buf) != 0) {
    return false;
  }
  *buf_storage = &buf;
  return body(payload, callee);

but this crashes on MinGW.

I've arranged a number of CI jobs here to showcase the various results of what's happening. Each job is run in release/debug mode to show the effect of a "fix" with/without optimizations. Each CI job is running cargo run test.wat which is executing the test.wat file in this PR in the CLI, basically just instantiating it. The test.wat file looks like:

(module (func unreachable) (start 0))

so a small wasm module that traps on instantiation. The longjmp in our trap handler (as unreachable compiles to ud2 so we execute a vectored exception handling on Windows) is what's crashing.

The results of CI look like:

build mode	`-DGOOD`	`-DBAD -DBAD_V1`	`-DBAD -DBAD_V2`
debug	✅	❌	✅
release	✅	❌	❌

For a bit easier to digest this is what we have for diffs:

`-DBAD_V1`

diff --git a/crates/wasmtime/src/runtime/vm/helpers.c b/crates/wasmtime/src/runtime/vm/helpers.c
index 84711450a..58ffea805 100644
--- a/crates/wasmtime/src/runtime/vm/helpers.c
+++ b/crates/wasmtime/src/runtime/vm/helpers.c
@@ -110,16 +110,16 @@ static bool wasmtime_setjmp_inverted(void **buf_storage,
                                      void *payload, void *callee) {
   platform_jmp_buf buf;
   if (platform_setjmp(buf) != 0) {
-    return true;
+    return false;
   }
   *buf_storage = &buf;
-  return !body(payload, callee);
+  return body(payload, callee);
 }

 bool VERSIONED_SYMBOL(wasmtime_setjmp)(void **buf_storage,
                                        bool (*body)(void *, void *),
                                        void *payload, void *callee) {
-  return !wasmtime_setjmp_inverted(buf_storage, body, payload, callee);
+  return wasmtime_setjmp_inverted(buf_storage, body, payload, callee);
 }

 void VERSIONED_SYMBOL(wasmtime_longjmp)(void *JmpBuf) {

`-DBAD_V2`

diff --git a/crates/wasmtime/src/runtime/vm/helpers.c b/crates/wasmtime/src/runtime/vm/helpers.c
index 84711450a..f40745099 100644
--- a/crates/wasmtime/src/runtime/vm/helpers.c
+++ b/crates/wasmtime/src/runtime/vm/helpers.c
@@ -110,16 +110,17 @@ static bool wasmtime_setjmp_inverted(void **buf_storage,
                                      void *payload, void *callee) {
   platform_jmp_buf buf;
   if (platform_setjmp(buf) != 0) {
-    return true;
+    return false;
   }
   *buf_storage = &buf;
-  return !body(payload, callee);
+  bool ret = body(payload, callee);
+  return ret;
 }

 bool VERSIONED_SYMBOL(wasmtime_setjmp)(void **buf_storage,
                                        bool (*body)(void *, void *),
                                        void *payload, void *callee) {
-  return !wasmtime_setjmp_inverted(buf_storage, body, payload, callee);
+  return wasmtime_setjmp_inverted(buf_storage, body, payload, callee);
 }

 void VERSIONED_SYMBOL(wasmtime_longjmp)(void *JmpBuf) {

the really weird thing is that -DBAD_V2, the only real change is "put the return value in a local variable", causes the test to be "fixed". In release mode this breaks down though.

So overall, is this something you'd be able to help us out debugging? We have a workaround for now so it's not too urgent, but the workaround is pretty wonky and feels quite brittle, so it seems like it'd be best to get a better understanding of it.

jsturtevant · 2024-12-02T17:22:35Z

I will take look at it this week

jsturtevant · 2024-12-07T00:33:08Z

Got some time to look into this today. I am able to reproduce it locally. This is a bit out of my expertise zone but doing some research I found a couple possible issues:

I have an env set up now (I hadn't used rust MinGW target before) and will spend some more time next week doing some comparisons in gdb and also see if I can get some help.

alexcrichton · 2024-12-07T01:10:40Z

Thanks for taking a look! And to be clear this isn't urgent in the sense that the current workaround seems to at least not cause CI to fall over and MinGW isn't a critical platform of ours (e.g. tier 1).

syntactically · 2025-01-06T15:39:10Z

I took a look at this (looped in by @jsturtevant). It looks like this is ultimately a codegen bug in mingw around the Windows x64 unwind ABI. Even though there isn't any explicit unwinding going on in wasmtime, mingw64, like msvc, defaults to implementing setjmp/longjmp using stack unwinding, in order to respect SEH and C++ destructors. It looks like it's possible to turn this off with -D__USE_MINGW_SETJMP_NON_SEH, but this is also worth fixing upstream, so I'll send up a patch to MinGW sometime soon.

In the meantime, your workaround is probably decently robust. The actual bug here is that, on x64, if you have a call instruction immediately prior to something that looks like a function epilogue, then the unwinder will, while unwinding that frame, see that the return address points into the epilogue and execute the epilogue without checking whether there was an exception handler (including one due to a setjmp) on that frame. MSVC and Clang fix this by ensuring that there is a nop after any call that might throw that is placed just before the function epilogue; this works because the unwinder doesn't consider nop to be a valid part of an epilogue. This is, as far as I can tell, technically a documented feature of the x64 unwind ABI, so not doing that is a bug in mingw. Your workaround works fairly reliably because it inserts an extra instruction for the negation between the call to the function that unwinds the stack via longjmp and the function epilogue; marking the function noinline or non-static (allowing callers in a different translation unit) to make sure the optimizer can't decide to elide the negation might add a bit of robustnes, though.

I haven't checked if Cranelift also has the problematic codegen pattern, but will take a quick look relatively shortly.

With respect to bytecodealliance#9688

With respect to bytecodealliance#9688 prtest:full

alexcrichton · 2025-01-06T16:02:36Z

Oh dear that's quite the bug, thank you so much for helping us diagnose this @syntactically! I'm testing out -D__USE_MINGW_SETJMP_NON_SEH in #9929 to see how that works because that feels more robust than an unusual code ordering to me.

I haven't checked if Cranelift also has the problematic codegen pattern, but will take a quick look relatively shortly.

Ah I can probably answer this one pretty easily which is to say that Cranelift is "trivially not affected" as Cranelift doesn't have support for catching exceptions at this time. We emit unwind information necessary to function (e.g. if something unwinds through Cranelift code) but at this time there's no catches via the system unwinder in Cranelift-generated code.

syntactically · 2025-01-06T16:38:11Z

Oh dear that's quite the bug, thank you so much for helping us diagnose this @syntactically! I'm testing out -D__USE_MINGW_SETJMP_NON_SEH in #9929 to see how that works because that feels more robust than an unusual code ordering to me.

Using that as a workaround if it does work probably makes sense! Once upstream mingw64 is fixed, we might want to revisit it though.

Ah I can probably answer this one pretty easily which is to say that Cranelift is "trivially not affected" as Cranelift doesn't have support for catching exceptions at this time. We emit unwind information necessary to function (e.g. if something unwinds through Cranelift code) but at this time there's no catches via the system unwinder in Cranelift-generated code.

Ah, I saw the unwind tables for the system unwinder were being generated and was assuming that precise unwinding with them was desirable. I think the consequences of this codegen are minimal if there aren't any handlers in the functions so compiled, but it will probably mess up system-unwinder stack traces and so forth, so I wonder if it might be worth dropping in the extra nop on windows anyway?

alexcrichton · 2025-01-06T18:17:02Z

Oh if it affects stack traces as well then yeah it might be a bug in Cranelift that we'll need to insert an extra nop as those are ideally correct to work best with debuggers

alexcrichton · 2025-01-06T18:18:11Z

In the meantime though I've flagged #9929 for merge which should resolve the main issue here (or at least gets code to a more-understandable state). Thanks again @syntactically and @jsturtevant for helping out here!

@syntactically if you find an issue with Cranelift feel free to open an issue to track the insertion of the nops!

* Test out `-D__USE_MINGW_SETJMP_NON_SEH` With respect to #9688 prtest:full * Add a comment for the workaround

Show the problem in CI

fdeee29

alexcrichton force-pushed the show-windows-weirdness branch from 58b1a33 to fdeee29 Compare November 26, 2024 22:46

Add test file

e6bb144

github-actions bot added the wasmtime:api Related to the API of the `wasmtime` crate itself label Nov 27, 2024

alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Jan 6, 2025

Test out -D__USE_MINGW_SETJMP_NON_SEH

c581d3d

With respect to bytecodealliance#9688

alexcrichton mentioned this pull request Jan 6, 2025

Test out -D__USE_MINGW_SETJMP_NON_SEH #9929

Merged

alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Jan 6, 2025

Test out -D__USE_MINGW_SETJMP_NON_SEH

c988c09

With respect to bytecodealliance#9688 prtest:full

alexcrichton closed this Jan 6, 2025

github-merge-queue bot pushed a commit that referenced this pull request Jan 6, 2025

Test out -D__USE_MINGW_SETJMP_NON_SEH (#9929)

a252039

* Test out `-D__USE_MINGW_SETJMP_NON_SEH` With respect to #9688 prtest:full * Add a comment for the workaround

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Show weirdness of Windows on CI #9688

Show weirdness of Windows on CI #9688

alexcrichton commented Nov 26, 2024

alexcrichton commented Nov 26, 2024

jsturtevant commented Dec 2, 2024

jsturtevant commented Dec 7, 2024

alexcrichton commented Dec 7, 2024

syntactically commented Jan 6, 2025

alexcrichton commented Jan 6, 2025

syntactically commented Jan 6, 2025

alexcrichton commented Jan 6, 2025

alexcrichton commented Jan 6, 2025

Show weirdness of Windows on CI #9688

Show weirdness of Windows on CI #9688

Conversation

alexcrichton commented Nov 26, 2024

alexcrichton commented Nov 26, 2024

-DBAD_V1

-DBAD_V2

jsturtevant commented Dec 2, 2024

jsturtevant commented Dec 7, 2024

alexcrichton commented Dec 7, 2024

syntactically commented Jan 6, 2025

alexcrichton commented Jan 6, 2025

syntactically commented Jan 6, 2025

alexcrichton commented Jan 6, 2025

alexcrichton commented Jan 6, 2025

`-DBAD_V1`

`-DBAD_V2`