-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Unresponsive Subsystems #3242
Comments
According to logs and metrics, the channel overseer sends messages and signals to was empty or close to empty (although there is a chance this is wrong, because metrics are only updates once per minute?). This could potentially indicate a bug in
The only problem is that this bug is hard to reproduce, although it might be helpful to confirm if #3037 helps with that. |
+1 Also encountered this. |
I also encountered this while being an active val on Kusama. The node didn't crash, it just hung there. Metrics are normal, and another validator in the same conditions and environment did not have the issue. In addition to fixing the bug itself, it would be good to ensure that the process terminates when this hapens.
|
this is new, all of the reports so far report that node shuts down when that happens:
|
Probably a Substrate-service / substrate-executor bug because it did log 'Essential task failed' |
+1 Also happened to me on the Kusama node while being in the active set |
Blockdaemon also experiencing this issue with a newly synced moonriver node
|
I faced this issue on a new node spun up with docker. The node disappeared from telemetry as soon as the chain caught up to the current block.
The node had unresponsive subsystems on 2 consecutive runs of this command, and on the 3rd time succeeded and is visibile on telemetry. first time becoming unresponsive:
2nd time:
|
I'm regularly encountering this where the chain will just stop syncing. The annoying part is that it doesn't crash gracefully so it needs to be manually restarted in my case (or you could have a separate process monitoring that restarts for you). As an aside maybe it would be a good idea to have a fallback thread builtin to the client where it will automatically restart itself in the event that it is not syncing the chain after a given time. i.e. it's hit a non-recoverable dead end bug in the code that can only be resolved by a restart. |
I think I have a consistent way to replicate this: restore a kusama
snapshot from polkashots.io, then start polkadot daemon for the first time.
It always goes into "unresponsive subsystem" state without terminating.
Then it never crashes again after restart.
…On Sat, Jul 17, 2021 at 1:40 PM carumusan ***@***.***> wrote:
I'm regularly encountering this where the chain will just stop syncing.
The annoying part is that it doesn't crash gracefully so it needs to be
manually restarted in my case (or you could have a separate process
monitoring that restarts for you).
As an aside maybe it would be a good idea to have a fallback thread
builtin to the client where it will automatically restart itself in the
event that it is not syncing the chain after a given time. i.e. it's hit a
non-recoverable dead end bug in the code that can only be resolved by a
restart.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3242 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAWXC6DXBGEO5KMG5VJSI3TYHTD5ANCNFSM46VJ3ZJA>
.
|
I'm getting this only on a remote Ubuntu instance (droplet) People are mentioning to restart the node/collator. Is there a special command for restarting, or do they just mean to run it again?
|
That's not enough. See the validator specs: https://wiki.polkadot.network/docs/maintain-guides-how-to-validate-polkadot#requirements, something similar is needed for a collator as well depending on your parachain.
You're not supposed to run the binary directly, but using some kind of daemon process manager, that will restart it for you when it exits unexpectedly, e.g. https://pm2.keymetrics.io/ or systemd service, see https://wiki.debian.org/systemd/Services The issue in general case hasn't been resolved yet, but in your case, it's likely caused by not powerful enough CPU. |
I had a stalled node that exits after a while running on the Moonriver network as collator. My machine fulfills the specs i7-9700K @4900MHz Immediate restart allowed the node to operate but it quickly failed again with the same error. Restarting again runs without failure now for 15 hours. I'm running the collator in a docker container. other specs In two subsequent runs the node failed. Also after removing it from the active collator set. The first failure run coincides with IO wait warnings up to 70%. The failure after restart didn't show any signs of higher than usual IO activity. memory consumption under normal operation is at 3 GB while swap is disabled. |
This seems to be stale & no longer an issue |
Approval voting and statement distribution are occasionally unresponsive. This can only occur when they haven't processed signals at a fast enough rate or haven't processed incoming external messages (from outside of the overseer) at a high enough rate.
It may be that backpressure on various subsystems, in particular networking, is too high. However, the
SubsystemContext::recv
implementation is biased to return signals before messages, so in the absence of deadlock the subsystem should eventually make enough progress to poll and receive the next signal.The approval voting subsystem does receive external messages from the GRANDPA voter and this is less likely to make progress than signals are.
The text was updated successfully, but these errors were encountered: