Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call stack exhaustion found via CPython's test_marshal test suite around October 19th/20th #5152

Closed
brettcannon opened this issue Oct 28, 2022 · 10 comments
Labels
bug Incorrect behavior in the current implementation that needs fixing

Comments

@brettcannon
Copy link
Contributor

Thanks for filing a bug report! Please fill out the TODOs below.

Note: if you want to report a security issue, please read our security policy!

Test Case

test_marshal from Python's test suite.

Steps to Reproduce

Easiest to just look at a GitHub Action run, e.g. https://github.com/tiran/cpython-wasm-test/actions/runs/3342891022/jobs/5535571299 .

Expected Results

The test to at least complete.

Actual Results

Call stack exhaustion which doesn't occur under wasmtime 1.0.1.

https://github.com/tiran/cpython-wasm-test/actions/runs/3294946299/jobs/5432988265#step:7:408

Versions and Environment

Wasmtime version or commit: latest

Operating system: Linux

Architecture: x64

Extra Info

This started occurring on October 22: https://github.com/tiran/cpython-wasm-test/actions/workflows/wasi.yml . As mentioned above, wasmtime 1.0.1 passes without issue and everything was fine until something changed between October 19th @ 22:04 and October 20th @ 22:13 for what https://wasmtime.dev/install.sh installs.

@brettcannon brettcannon added the bug Incorrect behavior in the current implementation that needs fixing label Oct 28, 2022
@brettcannon
Copy link
Contributor Author

/cc @tiran

@cfallin
Copy link
Member

cfallin commented Oct 28, 2022

A few bits of info that would help, if you wouldn't mind investigating further on your end:

  • What is this test case doing? Is it supposed to recurse deeply, but "not too deeply"? In other words, is the failure because we don't support a depth of recursion that we previously did, or because something else went wrong (correctness bug etc) that is causing infinite recursion now?
  • If it's a quantitative (how much recursion) rather than qualitative (recursion at all) bug, could you try to "bisect the limit" a bit and play with the test case's recursion depth, to see how much worse it got? E.g., did the test slip in just under a stack limit before, and it is just over now? Or did we explode stack usage per frame by a large factor?

Nothing major changed in regalloc or codegen between 1.0 and 2.0 that would lead to a large expansion of stackframe size, at least off the top of my head, but if the above investigation shows that we somehow have a much lower limit, we can of course dig in and see what's going on.

Thanks!

@jameysharp
Copy link
Contributor

I haven't tried reproducing this locally, but I was curious and looked through the failing test's source code a bit. I'm guessing the specific failing test is test_recursion_limit.

That test expects to be able to recurse over a data structure nested about 2,000 levels deep, and expects marshal.c's w_object function to raise ValueError if the data structure is any deeper. Since that function is mutually recursive with w_complex_object, this corresponds to a C stack depth of about 4,000 frames, plus a few for setting up the test. And the stack trace recorded in CI does show these two functions alternating many times, so I think this is the correct proximate cause.

So I guess that makes this a "quantitative" recursion bug.

@alexcrichton
Copy link
Member

I tried to download the last successful build's copy of python.wasm, but it has unfortunately expired and I could not acquire it. Instead downloaded the 3.11 rc2 from here and with this file:

import marshal

head = last = []
MAX_MARSHAL_STACK_DEPTH = 2000
for i in range(MAX_MARSHAL_STACK_DEPTH - 2):
    last.append([0])
    last = last[-1]

data = marshal.dumps(head)
new_head = marshal.loads(data)

this succeeded:

$ cargo run --release -- run --dir . --disable-cache -- python.wasm ./foo.py

for all of main, v1.0.0, v1.0.1, v2.0.0, and v2.0.1 on an x86_64 machine.

On an aaarch64 machine v1.0.0 and main both stack overflow. The main branch stack is 3111 wasm frames deep and the v1.0.0 overflows at 3112 frames deep, so not all that different.

Basically I unfortunately am unable to easily reproduce, and at least on AArch64 I can see a small difference in stack usage but nothing meaningful.


I'd echo what @cfallin said and reiterate @brettcannon if you're able it would be very helpful to reduce this issue to ideally "here's a wasm file and a directory tree and a command that fails." We can probably take it from there but without that it may be unlikely much can be done here.

@alexcrichton
Copy link
Member

Ok with some further testing and hammering I was able to get something. At 29c7de7 it takes 497712 bytes of stack to complete this test. At the next commit, 9715d91, it takes 528752 bytes of stack. The default allowed stack is 524288, which is why this is stack overflowing.

The stack required increased by 6%. Of the two co-recursive functions, w_object's stack size did not increase from 0x50 bytes and w_complex_object's stack size increased from 0x80 to 0x90 bytes. This represents a 7.7% increase in stack size for these two functions combined.

Overall I think that should fully explain what's going on here. A small increase in stack from w_complex_object happened to hit the limit. Otherwise everything appears to be normal.

I'll leave @jameysharp to conclude whether #4966 should have increased the stack size here or not. I've extracted the single function to this wasm file: extract.wasm.gz where I get:

$ ./before compile extract.wasm && objdump -S extract.cwasm | rg 'sub.*rsp'
      1a:       48 81 ec 80 00 00 00    sub    $0x80,%rsp
$ ./after compile extract.wasm && objdump -S extract.cwasm | rg 'sub.*rsp'
      1a:       48 81 ec 90 00 00 00    sub    $0x90,%rsp

locally on an x86_64 target.


@brettcannon to solve this issue for you in the meantime I believe there's a few options:

We could alternatively consider increasing the default stack size in a platform-specific manner. I believe the current limit is relatively restrictive due to our testing on Windows where the main thread stack was smaller by default than other platforms. Ideally though we could keep consistent platform behavior.

@cfallin
Copy link
Member

cfallin commented Oct 29, 2022

Thank you @alexcrichton for bisecting this!

I too am curious why exactly we use 16 extra bytes in the stackframe in this function. We can look into that a bit; however I think the high-order bit here, actually much more important, is to set expectations/norms around this sort of thing, so I'm going to go ahead and give some thoughts on that. (I don't mean to pick on this particular instance of the issue @brettcannon -- this is just a useful "norm-setting opportunity", so I'll make a general point.)

In general, I want to make sure we don't (accidentally or otherwise) fall into a "no regressions ever" expectation. The size of a stackframe (number of spillslots) is implementation-defined, and a Wasm compiler is free to be as efficient or as inefficient as it likes. A test (such as the one linked above) that depends on the stack size fitting just under some implementation-defined limit, given implementation-defined stackframe sizes, is a really brittle test: it could break with any minor version update. In other words, I want to be very clear here first that Wasmtime/Cranelift are perfectly standards-compliant here, and so in a strict sense, this is not a bug. I would consider a "blow up by factor of N" regression a performance bug, but this is not that: it's one extra spillslot.

Of course, all else being equal, regressions are bad. Over time the compiler should improve on average, and we would never want to take a change that adds an extra 16 bytes to a stackframe with no other benefit. But that qualification is key: most things have tradeoffs. (In the case here, 4-6% faster compile times, according to the commit message.) The impact of a strict no-regressions policy is that one ends up climbing the hill to a local maximum, and one can never really change the design otherwise: the core algorithms/approach are ossified forever (or a new algorithm/approach needs to go to absolutely heroic lengths to match or exceed the old approach in every single benchmark, and is always "on call" to be "fixed" when new cases are discovered).

Investigations of small N% regressions like this one can be useful because they can uncover real problems, or opportunities to fix a case. I'll also defer to @jameysharp whether he wants to dig into this one. I wouldn't be surprised however if it ends up being something like "values processed in slightly different order and spillslot-reuse heuristic now stops searching one step too soon" -- that sort of thing can easily happen in a complex compiler.

I'd second @alexcrichton 's suggestion to set either the recursion depth or stack limit to widen the gap between the two a bit -- in other words the most sustainable approach here is to not live right "on the edge" of the limit.

I don't want to discourage filing bugs for this sort of thing, as long as they have enough detail to be helpful; but I just want to make sure we're all on the same page re: acceptable outcomes. Happy to discuss this further if anyone would like!

@alexcrichton
Copy link
Member

I agree with everything you say, and I, too, do not wish to have a policy of zero regressions accepted. Personally I feel that this sequence of events for this issue is ok to play out, though. As an end-user you probably aren't that aware of how close to the limits you're running which means that if a release, like this one, bumps you over the limit it looks like a legitimate regression rather than a small amount more stack space. In that sense I very much want to still encourage these bugs being opened so we can investigate and figure out a conclusion.

On this particular but I personally see no issue in providing the --max-wasm-stack knob and considering it closed.

@jameysharp
Copy link
Contributor

I chatted about this offline with @cfallin and others.

I'm a little puzzled at how #4966 could have impacted stack frame size. In particular, Chris says regalloc2 does a value-merging pre-processing step that sounds to me like it should have made the input to register allocation be the same with or without the PR. Perhaps merging values earlier is enabling other Cranelift optimizations that lead to longer live-ranges?

It would certainly be nice to understand the root cause here. But I think @alexcrichton's --max-wasm-stack option is the best solution for making this particular CPython test reliable. And since it doesn't look like this is a sign of dramatically larger stack frames, I think it'll be okay if we don't get around to identifying the root cause.

So I'm closing this issue as I believe it's resolved by #5156. @brettcannon, if that solution isn't sufficient for you, feel free to re-open with more details about what you need. Thank you for your report!

@brettcannon
Copy link
Contributor Author

Thanks everyone for the details and looking into this! Sorry I couldn't react faster to the request for more info as I have been a bit busy with other things. 😅

As for the result, I'm totally happy with the answers. The issue was opened more to make sure we had not found a bug as CPython is somewhat notorious in doing in edge cases of platforms. 😁 I will lower the stack depth cap in CPython under WASI to a lower number (which is the key purpose of the test; to know when we need a lower stack depth so that we can make sure you don't overflow and we can raise an exception instead).

@brettcannon
Copy link
Contributor Author

I have opened python/cpython#98925 to track fixing this on the CPython side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior in the current implementation that needs fixing
Projects
None yet
Development

No branches or pull requests

4 participants