Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Proposal: allow block-execution WASM panics to be bubbled up as a node-level fatal signal #11101

Closed
2 tasks done
tsutsu opened this issue Mar 23, 2022 · 1 comment
Closed
2 tasks done

Comments

@tsutsu
Copy link

tsutsu commented Mar 23, 2022

Is there an existing issue?

  • I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

  • This is not a support question.

Description of bug

While work on enabling nodes to "react more intelligently" to WASM-execution panics is being done here (#10741), could we also get the ability to pipe "observed global precondition validity" information back from block execution, such that the framework could signal the node runtime that a particular block execution just resulted in a fatal error — i.e. that the node should now be gracefully shut down? (A non-graceful process-runtime panic triggered by the framework would be fine too, if state storage is designed to efficiently recover from such.)

It seems like, for various reasons (currently a bad implementation of execution-state caching, I believe, but other things could also trigger the same problem in the future), a node can get stuck in an invalid state without being aware of it — where it thinks all its preconditions hold, but they don't, such that when it goes to execute a block, it hits a WASM panic about "unreachable instruction executed", fails block processing, gets told by the network that the given block is still the consensus block, tries to execute it again, panics, fails, etc.

Example log message when this is happening (prints over and over):

2022-03-23 15:26:08 [Parachain] 💔 Error importing block 0x7934356df54e18d55d7f1854d5d934dad90e92c81ef11489e284bf0be23570a6: consensus error: Import failed: Error at calling runtime api: Execution failed: Execution aborted due to trap: wasm trap: wasm `unreachable` instruction executed
WASM backtrace:
    0: 0x23171e - <unknown>!rust_begin_unwind
    1: 0x3991 - <unknown>!core::panicking::panic_fmt::h6314b5c91abe7349
    2: 0x161058 - <unknown>!core::panicking::panic_display::hee1e647eee41c608
    3: 0x16021a - <unknown>!frame_executive::Executive<System,Block,Context,UnsignedValidator,AllPalletsWithSystem,COnRuntimeUpgrade>::execute_block::h355b9ea08341c5ad
    4: 0x22368d - <unknown>!Core_execute_block

In such cases, restarting the node resolves the problem (i.e. the WASM that previously failed to execute now executes successfully), as the node's durable state has not been corrupted, only the in-memory state of the previous run (and, conveniently, shutting down the node doesn't persist any of the corruption to disk.)

I know there are other workarounds for this specific problem (disabling the execution state cache entirely for now, I believe) but 1. these workarounds make execution a lot slower, potentially to the point of being unable to stay in sync; and 2. in the future, other problems could also cause in-memory state corruption in the same way.

WASM execution is "where the rubber meets the road" in terms of preconditions having to hold. So it would be nice if we could tell a node that's running in a less-than-100%-verified configuration, to consider WASM panics as a signal that the node has become corrupted and needs to be restarted. Perhaps not the first WASM panic for any given block; but perhaps if the node tries to execute the same block 10 times, and fails every time with the same WASM panic, that'd be a good time for the node itself to throw an "I'm wedged" fatal.

Steps to reproduce

Run any number of Polkadot/Kusama parachains in their current default configurations. I've seen this happen on all Moonbeam chains, and on Astar Shiden.

Alternatives to proposal

Specialized external watchdog

It would be possible to detect WASM-execution-panic log messages out-of-band in some sort of watchdog process, with the detection of said message triggering that watchdog process to ask the init system to restart the node service.

Advantages over an internal implementation: none that I can think of. It's not any simpler.

Disadvantages:

  • An external watchdog would have a much higher overhead than a node-internal WASM panic counter. Logs would have to be streamed through + parsed somewhere (whether in the watchdog, or within the OS logging manager, it's still expensive.)
  • The watchdog would have to be kept up-to-date in the face of changes to the log message the node might emit.

Generalized external watchdog

It would also be possible to just have an external watchdog process that uses JSON-RPC to poll a node's L2 sync state, and restart the node if its sync stops for any reason.

Advantages:

  • Only have to build this once; works for many blockchains.

Disadvantages:

  • It's a "dumb" check — it won't respect intentional pauses in sync, during e.g. post-upgrade database version migrations.
  • In a stable-equilibrium configuration (i.e. one that doesn't cause flapping), it would likely take far longer to notice a sync failure than directly observing failure signals would.
@tsutsu tsutsu changed the title Proposal: configuration option to treat block-execution WASM panics as a node-level fatal Proposal: allow block-execution WASM panics to be bubbled up as a node-level fatal signal Mar 23, 2022
@bkchr
Copy link
Member

bkchr commented Mar 23, 2022

This is a duplicate of: #10585

Which is already implemented :) Just currently hidden behind a feature.

@bkchr bkchr closed this as completed Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants