This repository has been archived by the owner on Nov 15, 2023. It is now read-only.
Proposal: allow block-execution WASM panics to be bubbled up as a node-level fatal signal #11101
Closed
2 tasks done
Is there an existing issue?
Experiencing problems? Have you tried our Stack Exchange first?
Description of bug
While work on enabling nodes to "react more intelligently" to WASM-execution panics is being done here (#10741), could we also get the ability to pipe "observed global precondition validity" information back from block execution, such that the framework could signal the node runtime that a particular block execution just resulted in a fatal error — i.e. that the node should now be gracefully shut down? (A non-graceful process-runtime panic triggered by the framework would be fine too, if state storage is designed to efficiently recover from such.)
It seems like, for various reasons (currently a bad implementation of execution-state caching, I believe, but other things could also trigger the same problem in the future), a node can get stuck in an invalid state without being aware of it — where it thinks all its preconditions hold, but they don't, such that when it goes to execute a block, it hits a WASM panic about "unreachable instruction executed", fails block processing, gets told by the network that the given block is still the consensus block, tries to execute it again, panics, fails, etc.
Example log message when this is happening (prints over and over):
In such cases, restarting the node resolves the problem (i.e. the WASM that previously failed to execute now executes successfully), as the node's durable state has not been corrupted, only the in-memory state of the previous run (and, conveniently, shutting down the node doesn't persist any of the corruption to disk.)
I know there are other workarounds for this specific problem (disabling the execution state cache entirely for now, I believe) but 1. these workarounds make execution a lot slower, potentially to the point of being unable to stay in sync; and 2. in the future, other problems could also cause in-memory state corruption in the same way.
WASM execution is "where the rubber meets the road" in terms of preconditions having to hold. So it would be nice if we could tell a node that's running in a less-than-100%-verified configuration, to consider WASM panics as a signal that the node has become corrupted and needs to be restarted. Perhaps not the first WASM panic for any given block; but perhaps if the node tries to execute the same block 10 times, and fails every time with the same WASM panic, that'd be a good time for the node itself to throw an "I'm wedged" fatal.
Steps to reproduce
Run any number of Polkadot/Kusama parachains in their current default configurations. I've seen this happen on all Moonbeam chains, and on Astar Shiden.
Alternatives to proposal
Specialized external watchdog
It would be possible to detect WASM-execution-panic log messages out-of-band in some sort of watchdog process, with the detection of said message triggering that watchdog process to ask the init system to restart the node service.
Advantages over an internal implementation: none that I can think of. It's not any simpler.
Disadvantages:
Generalized external watchdog
It would also be possible to just have an external watchdog process that uses JSON-RPC to poll a node's L2 sync state, and restart the node if its sync stops for any reason.
Advantages:
Disadvantages:
The text was updated successfully, but these errors were encountered: