Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Unresponsive Subsystems #3242

Closed
rphmeier opened this issue Jun 14, 2021 · 14 comments
Closed

Unresponsive Subsystems #3242

rphmeier opened this issue Jun 14, 2021 · 14 comments
Labels
I3-bug Fails to follow expected behavior.

Comments

@rphmeier
Copy link
Contributor

rphmeier commented Jun 14, 2021

Approval voting and statement distribution are occasionally unresponsive. This can only occur when they haven't processed signals at a fast enough rate or haven't processed incoming external messages (from outside of the overseer) at a high enough rate.

It may be that backpressure on various subsystems, in particular networking, is too high. However, the SubsystemContext::recv implementation is biased to return signals before messages, so in the absence of deadlock the subsystem should eventually make enough progress to poll and receive the next signal.

The approval voting subsystem does receive external messages from the GRANDPA voter and this is less likely to make progress than signals are.

@ordian
Copy link
Member

ordian commented Jun 14, 2021

According to logs and metrics, the channel overseer sends messages and signals to was empty or close to empty (although there is a chance this is wrong, because metrics are only updates once per minute?).

This could potentially indicate a bug in futures::mpsc. We could try using another implementation, there are plenty of options:

The only problem is that this bug is hard to reproduce, although it might be helpful to confirm if #3037 helps with that.

@carumusan
Copy link

+1 Also encountered this.

@nicolasochem
Copy link

I also encountered this while being an active val on Kusama. The node didn't crash, it just hung there. Metrics are normal, and another validator in the same conditions and environment did not have the issue.

In addition to fixing the bug itself, it would be good to ensure that the process terminates when this hapens.

 2021-07-01 10:28:08 💤 Idle (10 peers), best: #8153021 (0x233c…c58b), finalized #8152992 (0xaefe…5429), ⬇ 54.2kiB/s ⬆ 18.1kiB/s                                                                                                                            │
│ 2021-07-01 10:28:09 ✨ Imported #8153022 (0xb51b…249f)                                                                                                                                                                                                     │
│ 2021-07-01 10:28:10 ✨ Imported #8153023 (0x0c22…5783)                                                                                                                                                                                                     │
│ 2021-07-01 10:28:10 Subsystem statement-distribution-subsystem appears unresponsive.                                                                                                                                                                       │
│ 2021-07-01 10:28:10 Essential task `overseer` failed. Shutting down service.                                                                                                                                                                               │
│ 2021-07-01 10:28:10 err=Subsystem(Context("Signal channel is terminated and empty."))                                                                                                                                                                      │
│ 2021-07-01 10:28:10 subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { origin: "chain-api", source: Context("Signal channel is terminated and empty.") }                                                                         │
│ 2021-07-01 10:28:10 subsystem exited with error subsystem="candidate-validation-subsystem" err=FromOrigin { origin: "candidate-validation", source: Context("Signal channel is terminated and empty.") }                                                   │
│ 2021-07-01 10:28:10 error receiving message from subsystem context for job job="CandidateBackingJob" err=Context("Signal channel is terminated and empty.")                                                                                                │
│ 2021-07-01 10:28:10 subsystem exited with error subsystem="availability-recovery-subsystem" err=FromOrigin { origin: "availability-recovery", source: Context("Signal channel is terminated and empty.") }                                                 │
│ 2021-07-01 10:28:10 error receiving message from subsystem context for job job="BitfieldSigningJob" err=Context("Signal channel is terminated and empty.")                                                                                                 │
│ 2021-07-01 10:28:10 error receiving message from subsystem context for job job="ProvisioningJob" err=Context("Signal channel is terminated and empty.")                                                                                                    │
│ 2021-07-01 10:28:10 subsystem exited with error subsystem="runtime-api-subsystem" err=Context("Signal channel is terminated and empty.")                                                                                                                   │
│ 2021-07-01 10:28:10 error receiving message from subsystem context: Context("Signal channel is terminated and empty.") err=Context("Signal channel is terminated and empty.")                                                                              │
│ 2021-07-01 10:28:10 subsystem exited with error subsystem="approval-voting-subsystem" err=FromOrigin { origin: "approval-voting", source: Context("Signal channel is terminated and empty.") }                                                             │
│ 2021-07-01 10:28:31 [#8152979] 🗳  Starting unsigned phase(true).                                                                                                                                                                                           │
│ 2021-07-01 10:28:51 [#8152979] 🗳  Starting unsigned phase(true).                                                                                                                                                                                           │
│ 2021-07-01 10:32:15 [#8152979] 🗳  Starting unsigned phase(true).                         

@ordian
Copy link
Member

ordian commented Jul 1, 2021

it would be good to ensure that the process terminates when this hapens.

this is new, all of the reports so far report that node shuts down when that happens:

ERROR Essential task `overseer` failed. Shutting down service.
Error:                                                                     
   0: Other: Essential task failed.                                             
                                                                                
Location:                                  
   src/main.rs:25    

@rphmeier
Copy link
Contributor Author

rphmeier commented Jul 2, 2021

Probably a Substrate-service / substrate-executor bug because it did log 'Essential task failed'

@pmensik
Copy link
Contributor

pmensik commented Jul 9, 2021

+1 Also happened to me on the Kusama node while being in the active set

@kobrien
Copy link

kobrien commented Jul 13, 2021

Blockdaemon also experiencing this issue with a newly synced moonriver node

2021-07-13 17:39:39 [🌗] ✨ Imported #96374 (0x2301…f2fa)
2021-07-13 17:39:49 [Relaychain] Subsystem candidate-validation-subsystem appears unresponsive.
2021-07-13 17:39:49 [Relaychain] err=Subsystem(Context("Signal channel is terminated and empty."))
2021-07-13 17:39:49 [Relaychain] Essential task `overseer` failed. Shutting down service.
2021-07-13 17:39:49 [Relaychain] subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { origin: "chain-api", source: Context("Signal channel is terminated and empty.") }
2021-07-13 17:39:49 [Relaychain] subsystem exited with error subsystem="approval-voting-subsystem" err=FromOrigin { origin: "approval-voting", source: Context("Signal channel is terminated and empty.") }
2021-07-13 17:39:50 Could not send spawned task handle to queue: send failed because receiver is gone
2021-07-13 17:39:50 Could not send spawned task handle to queue: send failed because receiver is gone
2021-07-13 17:39:50 Could not send spawned task handle to queue: send failed because receiver is gone
Error: Service(Other("Essential task failed."))

@higginator
Copy link

I faced this issue on a new node spun up with docker. The node disappeared from telemetry as soon as the chain caught up to the current block.
This is the command I ran:

docker run -it -p 30333:30333 -p 9933:9933 -p 9944:9944 -v <path> --validator --chain kusama --name <name> --telemetry-url 'wss://telemetry-backend.w3f.community/submit 1' --telemetry-url 'wss://telemetry.polkadot.io/submit/ 0'

The node had unresponsive subsystems on 2 consecutive runs of this command, and on the 3rd time succeeded and is visibile on telemetry.

first time becoming unresponsive:

2021-07-15 05:23:13 Candidate included without being backed? candidate_hash=0x5ac213cf1c5d8a5256b09501ff5241d9defd235a994280df9939723362378598
2021-07-15 05:23:13 ✨ Imported #8350215 (0xce38…6d6f)
2021-07-15 05:23:13 Candidate included without being backed? candidate_hash=0xa46962833ad0425413e3b3c8a8ce48d2b4b23d6f7a042fcadcb977c3d43fe476
2021-07-15 05:23:13 💔 Error importing block 0x76e21512b4441f864bb42c9e30efca7ea5e19b5ac6afc4d427e5cd5579c9f0cc: Err(Other(ClientImport("Invalid operation in the pending changes tree: Tried to import or finalize node that is an ancestor of a previously finalized node")))
2021-07-15 05:23:14 Candidate included without being backed? candidate_hash=0x73dfb5b0d349e4493ac01a8559650ee780d858d07a6e6feb30b3b23f30b97552
2021-07-15 05:23:14 Candidate included without being backed? candidate_hash=0x1a3b342979609f56a01af8fb0df023e7c170ac7e71c4594a7d7607963c54d4ab
2021-07-15 05:23:14 Candidate included without being backed? candidate_hash=0xe636bb5163f544e941dee5648ce65c00d0ae55b4e4facf10c2e438f11a30f536
2021-07-15 05:23:14 Candidate included without being backed? candidate_hash=0xa8fc36097448eb0cbcf9fc54d194e939ec2db77f88c8f119b4a652d68ea6a761
2021-07-15 05:23:14 Candidate included without being backed? candidate_hash=0x42d9303adab0840d3b0de1c9e135a6014444632f05c34a884fcf023807e27ff6
2021-07-15 05:23:14 Candidate included without being backed? candidate_hash=0x3150736ef1683c8297aac0b2f8873c9e1f18f20a227f4c5a437af01e0b883b4d
2021-07-15 05:23:14 Candidate included without being backed? candidate_hash=0x6fa740e05375055494f8e7be65ac1b26127f203a1596aab05956d255e5ad36bf
2021-07-15 05:23:14 Candidate included without being backed? candidate_hash=0x079809075174dcf3721a5617f3698d0a06a4958eed93258756cf53e36456fd8c
2021-07-15 05:23:14 Candidate included without being backed? candidate_hash=0xc0225ef87ed98635c4e2841a7835d1e6d7a6dc28864a28be1a38eeeeb2cf03b2
2021-07-15 05:23:14 ⚙️  Syncing  0.5 bps, target=#8350279 (38 peers), best: #8350215 (0xce38…6d6f), finalized #8350215 (0xce38…6d6f), ⬇ 1.5MiB/s ⬆ 10.8kiB/s
2021-07-15 05:23:14 Subsystem availability-store-subsystem appears unresponsive. signal=BlockFinalized(0x8c664626ce3d036f41a957af7db6fbf5f2e99299983a755f5c6bb638268f8a4b, 8350017) received=675
2021-07-15 05:23:14 error receiving message from subsystem context for job job="CandidateBackingJob" err=Context("Signal channel is terminated and empty.")
2021-07-15 05:23:14 subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { origin: "chain-api", source: Context("Signal channel is terminated and empty.") }
2021-07-15 05:23:14 subsystem exited with error subsystem="candidate-validation-subsystem" err=FromOrigin { origin: "candidate-validation", source: Context("Signal channel is terminated and empty.") }
2021-07-15 05:23:14 Essential task `overseer` failed. Shutting down service.
2021-07-15 05:23:14 error receiving message from subsystem context for job job="BitfieldSigningJob" err=Context("Signal channel is terminated and empty.")
2021-07-15 05:23:14 error receiving message from subsystem context: Context("Signal channel is terminated and empty.") err=Context("Signal channel is terminated and empty.")
2021-07-15 05:23:14 Shutting down Network Bridge due to error err=Context("Signal channel is terminated and empty.")
2021-07-15 05:23:14 error receiving message from subsystem context for job job="ProvisioningJob" err=Context("Signal channel is terminated and empty.")
2021-07-15 05:23:14 subsystem exited with error subsystem="network-bridge-subsystem" err=FromOrigin { origin: "network-bridge", source: Context("Received SubsystemError from overseer: Context(\"Signal channel is terminated and empty.\")") }
2021-07-15 05:23:14 subsystem exited with error subsystem="statement-distribution-subsystem" err=FromOrigin { origin: "statement-distribution", source: SubsystemReceive(Context("Signal channel is terminated and empty.")) }
2021-07-15 05:23:14 subsystem exited with error subsystem="runtime-api-subsystem" err=Context("Signal channel is terminated and empty.")
2021-07-15 05:23:14 subsystem exited with error subsystem="availability-recovery-subsystem" err=FromOrigin { origin: "availability-recovery", source: Context("Signal channel is terminated and empty.") }

2nd time:

2021-07-15 06:00:15 Candidate included without being backed? candidate_hash=0x84253363968405a42601378a7d6ce2cfd367a0d62a615b1ef6eca0f884b7b644
2021-07-15 06:00:15 Candidate included without being backed? candidate_hash=0x1d3ab207d4c23b37393c9c17426575c85f60678179c8cade2c4ecc10a7ef7a2e
2021-07-15 06:00:15 Subsystem availability-store-subsystem appears unresponsive. signal=BlockFinalized(0xd565c5614a291d7181d660436a87a956ce59131c391900a345ad4cfad40225f1, 8350451) received=178
2021-07-15 06:00:15 subsystem exited with error subsystem="candidate-validation-subsystem" err=FromOrigin { origin: "candidate-validation", source: Context("Signal channel is terminated and empty.") }
2021-07-15 06:00:15 error receiving message from subsystem context for job job="CandidateBackingJob" err=Context("Signal channel is terminated and empty.")
2021-07-15 06:00:15 subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { origin: "chain-api", source: Context("Signal channel is terminated and empty.") }
2021-07-15 06:00:15 Essential task `overseer` failed. Shutting down service.
2021-07-15 06:00:15 subsystem exited with error subsystem="availability-distribution-subsystem" err=FromOrigin { origin: "availability-distribution", source: IncomingMessageChannel(Context("Signal channel is terminated and empty.")) }
2021-07-15 06:00:15 error receiving message from subsystem context for job job="ProvisioningJob" err=Context("Signal channel is terminated and empty.")
2021-07-15 06:00:15 subsystem exited with error subsystem="availability-recovery-subsystem" err=FromOrigin { origin: "availability-recovery", source: Context("Signal channel is terminated and empty.") }
2021-07-15 06:00:15 Shutting down Network Bridge due to error err=Context("Signal channel is terminated and empty.")
2021-07-15 06:00:15 error receiving message from subsystem context: Context("Signal channel is terminated and empty.") err=Context("Signal channel is terminated and empty.")
2021-07-15 06:00:15 subsystem exited with error subsystem="network-bridge-subsystem" err=FromOrigin { origin: "network-bridge", source: Context("Received SubsystemError from overseer: Context(\"Signal channel is terminated and empty.\")") }
2021-07-15 06:00:15 subsystem exited with error subsystem="runtime-api-subsystem" err=Context("Signal channel is terminated and empty.")
2021-07-15 06:00:15 error receiving message from subsystem context for job job="BitfieldSigningJob" err=Context("Signal channel is terminated and empty.")
2021-07-15 06:00:15 subsystem exited with error subsystem="statement-distribution-subsystem" err=FromOrigin { origin: "statement-distribution", source: SubsystemReceive(Context("Signal channel is terminated and empty.")) }
2021-07-15 06:00:15 💤 Idle (32 peers), best: #8350648 (0xee52…29e7), finalized #8350645 (0x40ff…f422), ⬇ 163.0kiB/s ⬆ 1.7MiB/s

@carumusan
Copy link

I'm regularly encountering this where the chain will just stop syncing. The annoying part is that it doesn't crash gracefully so it needs to be manually restarted in my case (or you could have a separate process monitoring that restarts for you).

As an aside maybe it would be a good idea to have a fallback thread builtin to the client where it will automatically restart itself in the event that it is not syncing the chain after a given time. i.e. it's hit a non-recoverable dead end bug in the code that can only be resolved by a restart.

@nicolasochem
Copy link

nicolasochem commented Jul 18, 2021 via email

@decentration
Copy link

decentration commented Jul 27, 2021

I'm getting this only on a remote Ubuntu instance (droplet) 2 GB Memory / 1 Intel vCPU / 50 GB Disk / LON1 - Ubuntu 20.04 (LTS) x64, the collator works fine on a local machine, so perhaps its a RAM issue? I was wondering this could be a firewall issue at all?

People are mentioning to restart the node/collator. Is there a special command for restarting, or do they just mean to run it again?

./target/release/parachain-collator --collator --parachain-id 2001 --force-authoring --port 30333 --ws-port 8845 --rpc-cors all --no-mdns --unsafe-rpc-external -- --execution wasm --chain rococo-chachacha.json --no-mdns --unsafe-rpc-external
2021-07-27 16:11:23 Parachain Collator Template    
2021-07-27 16:11:23 ✌️  version 3.0.0-6c977d5-x86_64-linux-gnu    
2021-07-27 16:11:23 ❤️  by Anonymous, 2017-2021    
2021-07-27 16:11:23 📋 Chain specification: Local Testnet    
2021-07-27 16:11:23 🏷 Node name: noisy-twig-7306    
2021-07-27 16:11:23 👤 Role: AUTHORITY    
2021-07-27 16:11:23 💾 Database: RocksDb at /root/.local/share/parachain-collator/chains/local_testnet/db    
2021-07-27 16:11:23 ⛓  Native runtime: template-parachain-1 (template-parachain-0.tx1.au1)    
2021-07-27 16:11:29 Parachain id: Id(2001)    
2021-07-27 16:11:29 Parachain Account: 5Ec4AhPV91i9yNuiWuNunPf6AQCYDhFTTA4G5QCbtqYApH9E    
2021-07-27 16:11:29 Parachain genesis state: 0x00000000000000000000000000000000000000000000000000000000000000000074e3d3d5b2839a25f0807e09a40c7c8a8f02246b89961851763effef1f89bbee03170a2e7597b7b7e3d84c05391d139a62b157e78786d8c082f29dcf4c11131400    
2021-07-27 16:11:29 Is collating: yes    
2021-07-27 16:11:29 [Relaychain] 🏷 Local node identity is: 12D3KooWCy1SQgEZq69a8g13kqZrW1aCLx2kW4YTuKozJZ3xhww6    
2021-07-27 16:11:29 [Relaychain] 📦 Highest known block at #19014    
2021-07-27 16:11:29 [Relaychain] 〽️ Prometheus server started at 127.0.0.1:9616    
2021-07-27 16:11:29 [Relaychain] Listening for new connections on 127.0.0.1:9945.    
2021-07-27 16:11:29 [Parachain] Using default protocol ID "sup" because none is configured in the chain specs    
2021-07-27 16:11:29 [Parachain] 🏷 Local node identity is: 12D3KooWHuTZLp9bqWK389HfJDVhmYLUMjX8Pac3tAeTqrsvDyX7    
2021-07-27 16:11:29 [Parachain] 📦 Highest known block at #0    
2021-07-27 16:11:29 [Parachain] Listening for new connections on 127.0.0.1:8845.    
2021-07-27 16:11:29 [Parachain] 〽️ Prometheus server started at 127.0.0.1:9615    
2021-07-27 16:11:30 [Relaychain] 🔍 Discovered new external address for our node: /ip4/10.156.1.116/tcp/30334/ws/p2p/12D3KooWCy1SQgEZq69a8g13kqZrW1aCLx2kW4YTuKozJZ3xhww6    
2021-07-27 16:11:30 [Relaychain] 🔍 Discovered new external address for our node: /ip4/10.156.1.11/tcp/30334/ws/p2p/12D3KooWCy1SQgEZq69a8g13kqZrW1aCLx2kW4YTuKozJZ3xhww6    
2021-07-27 16:11:30 [Relaychain] 🔍 Discovered new external address for our node: /ip4/165.227.236.82/tcp/30334/ws/p2p/12D3KooWCy1SQgEZq69a8g13kqZrW1aCLx2kW4YTuKozJZ3xhww6    
2021-07-27 16:11:34 [Relaychain] ⚙️  Syncing, target=#69976 (11 peers), best: #19047 (0x3c1d…eeac), finalized #19001 (0x69d0…598d), ⬇ 742.6kiB/s ⬆ 10.5kiB/s    
2021-07-27 16:11:34 [Parachain] 💤 Idle (0 peers), best: #0 (0x146a…8758), finalized #0 (0x146a…8758), ⬇ 0 ⬆ 0    
2021-07-27 16:11:45 [Relaychain] Subsystem candidate-validation-subsystem appears unresponsive. signal=BlockFinalized(0x59b51f8f9d450050bb4c373cb097798d6faa9dd91e81a957bf16b1b816d3d752, 19065) received=65
2021-07-27 16:11:45 [Relaychain] Essential task `overseer` failed. Shutting down service.    
2021-07-27 16:11:45 [Relaychain] err=Subsystem(Context("Signal channel is terminated and empty."))
2021-07-27 16:11:45 [Relaychain] subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { origin: "chain-api", source: Context("Signal channel is terminated and empty.") }
2021-07-27 16:11:45 [Relaychain] subsystem exited with error subsystem="approval-voting-subsystem" err=FromOrigin { origin: "approval-voting", source: Context("Signal channel is terminated and empty.") }
Error: Service(Other("Essential task failed."))

@ordian
Copy link
Member

ordian commented Jul 27, 2021

1 Intel vCPU

That's not enough. See the validator specs: https://wiki.polkadot.network/docs/maintain-guides-how-to-validate-polkadot#requirements, something similar is needed for a collator as well depending on your parachain.

People mentioning restart the node/collator, is there a special command for restarting, or do they just mean to run it again?

You're not supposed to run the binary directly, but using some kind of daemon process manager, that will restart it for you when it exits unexpectedly, e.g. https://pm2.keymetrics.io/ or systemd service, see https://wiki.debian.org/systemd/Services
Here is an example for validator service: https://github.com/paritytech/polkadot/blob/master/scripts/packaging/polkadot.service

The issue in general case hasn't been resolved yet, but in your case, it's likely caused by not powerful enough CPU.

@czarly
Copy link

czarly commented Aug 7, 2021

I had a stalled node that exits after a while running on the Moonriver network as collator. My machine fulfills the specs i7-9700K @4900MHz

Immediate restart allowed the node to operate but it quickly failed again with the same error. Restarting again runs without failure now for 15 hours. I'm running the collator in a docker container.

other specs
64 GB RAM
Samsung SSD 980 1TB
1 GBit/s connectivity

In two subsequent runs the node failed. Also after removing it from the active collator set. The first failure run coincides with IO wait warnings up to 70%. The failure after restart didn't show any signs of higher than usual IO activity.

memory consumption under normal operation is at 3 GB while swap is disabled.

@rphmeier
Copy link
Contributor Author

This seems to be stale & no longer an issue

@rphmeier rphmeier closed this as not planned Won't fix, can't repro, duplicate, stale Oct 31, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I3-bug Fails to follow expected behavior.
Projects
None yet
Development

No branches or pull requests

9 participants