Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

polkadot-collator nodes stop blocks importing by accident with "Collation wasn't advertised to any validator" #3139

Closed
mn13 opened this issue May 30, 2021 · 9 comments
Labels
I3-bug Fails to follow expected behavior.

Comments

@mn13
Copy link

mn13 commented May 30, 2021

Initial conditions:

  1. Rococo-local relay chain with two polkadot nodes
    Build: cargo build --release
    Build spec command: ./polkadot build-spec --chain rococo-local --disable-default-bootnode --raw > rococo-local-cfde.json
    Run:
    ./polkadot --chain rococo-local-cfde.json --alice -d ~/cumulus-playground/relay-alice
    ./polkadot --chain rococo-local-cfde.json --bob -d ~/cumulus-playground/relay-bob --port 30334

  2. Parachain with two polkadot-collator nodes as collators. The link points to fork with "rococo-local" in /polkadot-parachains/src/chain_spec.rs
    Build: cargo build --release
    Export genesis and state:
    ./polkadot-collator export-genesis-wasm > genesis-wasm
    ./polkadot-collator export-genesis-state --parachain-id 200 > genesis-state
    Run:
    ./polkadot-collator --collator --alice --force-authoring -d ~/cumulus-playground/polkadot-collator-alice --parachain-id 200 --port 40335 --ws-port 9946 -- --execution wasm --chain ./rococo-local-cfde.json --port 30335
    ./polkadot-collator --collator --bob --force-authoring -d ~/cumulus-playground/polkadot-collator-bob --parachain-id 200 --port 40336 --ws-port 9947 -- --execution wasm --chain ./rococo-local-cfde.json --port 30335

  3. Register parachain from polkadot.js.org/apps connected to relay's alice node: Developer -> Sudo -> parasSudoWrapper -> parasInitialize

The bug:
Accidently parachain's blocks just stopped importing, it can be on 80 or on 200 or some another block number. Logs from parachain's Alice node, where last successfull imported block was 548:

2021-05-28 16:33:00  [Relaychain] ✨ Imported #1559 (0xee93…80c3)    
2021-05-28 16:33:00  [Parachain] Starting collation. relay_parent=0xee93c82e8cfdd5a86b037e5e64938a49da4a9863a8d40662aa08f6c064c180c3 at=0xf29568f3e4adecb8cfd06d1ad1bbe02d6d6920694e22fc8c5680ecfa62b5f287
2021-05-28 16:33:01  [Relaychain] 💤 Idle (4 peers), best: #1559 (0xee93…80c3), finalized #1556 (0x0d48…757a), ⬇ 1.1kiB/s ⬆ 0.5kiB/s    
2021-05-28 16:33:02  [Parachain] 💤 Idle (1 peers), best: #547 (0xf295…f287), finalized #546 (0xe375…a6e8), ⬇ 24 B/s ⬆ 24 B/s    
2021-05-28 16:33:06  [Relaychain] ✨ Imported #1560 (0x0060…507f)    
2021-05-28 16:33:06  [Relaychain] ✨ Imported #1560 (0xae8f…8a8f)    
2021-05-28 16:33:06  [Parachain] Starting collation. relay_parent=0x0060715fef94ff79c865b8deb2c1fc673e07c3ae0a26f46a2b43a48c361e507f at=0xf29568f3e4adecb8cfd06d1ad1bbe02d6d6920694e22fc8c5680ecfa62b5f287
2021-05-28 16:33:06  [Parachain] Starting collation. relay_parent=0xae8fc584a3c87be4cdf52d1883700511642e472cc2e4a5d3a7950bd6e2b68a8f at=0xf29568f3e4adecb8cfd06d1ad1bbe02d6d6920694e22fc8c5680ecfa62b5f287
2021-05-28 16:33:06  [Relaychain] 💤 Idle (4 peers), best: #1560 (0x0060…507f), finalized #1557 (0xbeee…cdb7), ⬇ 1.2kiB/s ⬆ 0.6kiB/s    
2021-05-28 16:33:07  [Parachain] 💤 Idle (1 peers), best: #547 (0xf295…f287), finalized #546 (0xe375…a6e8), ⬇ 0 ⬆ 0    
2021-05-28 16:33:11  [Relaychain] 💤 Idle (4 peers), best: #1560 (0x0060…507f), finalized #1558 (0xe1d1…6ec3), ⬇ 0.6kiB/s ⬆ 0.5kiB/s    
2021-05-28 16:33:12  [Relaychain] 👶 New epoch 156 launching at block 0xf6fc…88a2 (block slot 270368132 >= start slot 270368132).    
2021-05-28 16:33:12  [Relaychain] 👶 Next epoch starts at slot 270368142    
2021-05-28 16:33:12  [Relaychain] ✨ Imported #1561 (0xf6fc…88a2)    
2021-05-28 16:33:12  [Parachain] Starting collation. relay_parent=0xf6fca6f8accae0bec83efd07ef6325d7887af4b1c835d71c1f6d549bc1d388a2 at=0xf29568f3e4adecb8cfd06d1ad1bbe02d6d6920694e22fc8c5680ecfa62b5f287
2021-05-28 16:33:12  [Parachain] 🙌 Starting consensus session on top of parent 0xf29568f3e4adecb8cfd06d1ad1bbe02d6d6920694e22fc8c5680ecfa62b5f287    
2021-05-28 16:33:12  [Parachain] 🎁 Prepared block for proposing at 548 [hash: 0x31f29d54f8ce3f4199b902c0c3c276960ec20d2d9693611b9b5bd77a853a09d9; parent_hash: 0xf295…f287; extrinsics (2): [0xaf04…832a, 0x15c2…24bd]]    
2021-05-28 16:33:12  [Parachain] 🔖 Pre-sealed block for proposal at 548. Hash now 0x24add5dd989454987f3f42e769c4c939965af19ee95597b12cd4a90c80bfcb55, previously 0x31f29d54f8ce3f4199b902c0c3c276960ec20d2d9693611b9b5bd77a853a09d9.    
2021-05-28 16:33:12  [Parachain] Produced proof-of-validity candidate. block_hash=0x24add5dd989454987f3f42e769c4c939965af19ee95597b12cd4a90c80bfcb55
2021-05-28 16:33:12  [Parachain] ✨ Imported #548 (0x24ad…cb55)    
2021-05-28 16:33:12  [Parachain] 💤 Idle (1 peers), best: #547 (0xf295…f287), finalized #546 (0xe375…a6e8), ⬇ 62 B/s ⬆ 62 B/s    
2021-05-28 16:33:16  [Relaychain] 💤 Idle (4 peers), best: #1561 (0xf6fc…88a2), finalized #1559 (0xee93…80c3), ⬇ 1.8kiB/s ⬆ 81.7kiB/s    
2021-05-28 16:33:17  [Parachain] 💤 Idle (1 peers), best: #547 (0xf295…f287), finalized #547 (0xf295…f287), ⬇ 64 B/s ⬆ 0.5kiB/s    
2021-05-28 16:33:18  [Relaychain] ✨ Imported #1562 (0xdf86…2afd)    
2021-05-28 16:33:18  [Relaychain] Moving approval window from session 150..=155 to 151..=156
2021-05-28 16:33:21  [Relaychain] 💤 Idle (4 peers), best: #1562 (0xdf86…2afd), finalized #1559 (0xee93…80c3), ⬇ 0.5kiB/s ⬆ 0.3kiB/s    
2021-05-28 16:33:22  [Parachain] 💤 Idle (1 peers), best: #547 (0xf295…f287), finalized #547 (0xf295…f287), ⬇ 0 ⬆ 0    
2021-05-28 16:33:24  [Relaychain] ✨ Imported #1563 (0xf6d6…db1f)    
2021-05-28 16:33:24  [Parachain] Starting collation. relay_parent=0xf6d65c4871dc4e90fe535de05143eed473949373cf946d5f407be1e2f218db1f at=0x24add5dd989454987f3f42e769c4c939965af19ee95597b12cd4a90c80bfcb55
2021-05-28 16:33:26  [Relaychain] 💤 Idle (4 peers), best: #1563 (0xf6d6…db1f), finalized #1560 (0x0060…507f), ⬇ 1.6kiB/s ⬆ 1.0kiB/s    
2021-05-28 16:33:27  [Parachain] 💤 Idle (1 peers), best: #548 (0x24ad…cb55), finalized #547 (0xf295…f287), ⬇ 0.8kiB/s ⬆ 0.6kiB/s    
2021-05-28 16:33:30  [Relaychain] ✨ Imported #1564 (0xf954…a317)    
2021-05-28 16:33:30  [Parachain] Starting collation. relay_parent=0xf95461048bbdb96d2c0a901eb9514b09f0d2a361487e5051ef04dff4014ba317 at=0x24add5dd989454987f3f42e769c4c939965af19ee95597b12cd4a90c80bfcb55
2021-05-28 16:33:31  [Relaychain] 💤 Idle (4 peers), best: #1564 (0xf954…a317), finalized #1560 (0x0060…507f), ⬇ 1.0kiB/s ⬆ 0.2kiB/s    
2021-05-28 16:33:32  [Parachain] 💤 Idle (1 peers), best: #548 (0x24ad…cb55), finalized #547 (0xf295…f287), ⬇ 63 B/s ⬆ 0.1kiB/s    
2021-05-28 16:33:32  [Relaychain] 👴 Applying authority set change scheduled at block #1561    
2021-05-28 16:33:32  [Relaychain] 👴 Applying GRANDPA set change to new set [(Public(88dc3417d5058ec4b4503e0c12ea1a0a89be200fe98922423d4334014fa6b0ee (5FA9nQDV...)), 1), (Public(d17c2d7823ebf260fd138f2d7e27d114c0145d968b5ff5006125f2414fadae69 (5GoNkf6W...)), 1)]    
2021-05-28 16:33:32  [Relaychain] 👴 Imported justification for block #1561 that triggers command Changing authorities, signaling voter.    
2021-05-28 16:33:32  [Relaychain] 💔 Invalid justification provided by 12D3KooWMTJGLuW8WSoUDqNsk9CSdtAssqtfQYhya3g3dizs1ebu for #0xf6fc…88a2    
2021-05-28 16:33:36  [Relaychain] ✨ Imported #1565 (0xb813…e5f7)    
2021-05-28 16:33:36  [Parachain] Starting collation. relay_parent=0xb8130cbcadcb50fed2a624d5df4eb1b938e92185a34bccf731da41e34521e5f7 at=0x24add5dd989454987f3f42e769c4c939965af19ee95597b12cd4a90c80bfcb55
2021-05-28 16:33:36  [Parachain] 🙌 Starting consensus session on top of parent 0x24add5dd989454987f3f42e769c4c939965af19ee95597b12cd4a90c80bfcb55    
2021-05-28 16:33:36  [Parachain] 🎁 Prepared block for proposing at 549 [hash: 0xb268a0274d4d6413b2fc46031581ed83f49a83a67f56365b6432e4466cab814c; parent_hash: 0x24ad…cb55; extrinsics (2): [0x23b3…6126, 0x708b…5177]]    
2021-05-28 16:33:36  [Parachain] 🔖 Pre-sealed block for proposal at 549. Hash now 0xe32727a34ddc4ed82466c9a82c960aa03fe7abe8bceb9cdc8bc0a692e0192ac6, previously 0xb268a0274d4d6413b2fc46031581ed83f49a83a67f56365b6432e4466cab814c.    
2021-05-28 16:33:36  [Parachain] ✨ Imported #549 (0xe327…2ac6)    
2021-05-28 16:33:36  [Parachain] Produced proof-of-validity candidate. block_hash=0xe32727a34ddc4ed82466c9a82c960aa03fe7abe8bceb9cdc8bc0a692e0192ac6
2021-05-28 16:33:36  [Relaychain] 💤 Idle (3 peers), best: #1565 (0xb813…e5f7), finalized #1562 (0xdf86…2afd), ⬇ 1.0kiB/s ⬆ 0.4kiB/s    
2021-05-28 16:33:37  [Parachain] 💤 Idle (1 peers), best: #548 (0x24ad…cb55), finalized #547 (0xf295…f287), ⬇ 32 B/s ⬆ 93 B/s    
2021-05-28 16:33:41  [Relaychain] 💤 Idle (3 peers), best: #1565 (0xb813…e5f7), finalized #1563 (0xf6d6…db1f), ⬇ 0.3kiB/s ⬆ 0.2kiB/s    
2021-05-28 16:33:42  [Relaychain] ✨ Imported #1566 (0x0ded…27cb)    
2021-05-28 16:33:42  [Relaychain] Collation wasn't advertised to any validator. candidate_hash=0xdce6c1d1ff94687620f5ca7a7bd57ba310bf3ab165e5212764518566c7b655b8 pov_hash=0xc8a19d87dc0542914fabd2da3b75115d5f088e32150fdb5e0bbe542b4b3ad013
2021-05-28 16:33:42  [Parachain] Starting collation. relay_parent=0x0deda9af50cd24b9709af40a079627e6782575c0017921299ff40ce9d35527cb at=0x24add5dd989454987f3f42e769c4c939965af19ee95597b12cd4a90c80bfcb55
2021-05-28 16:33:42  [Parachain] 🙌 Starting consensus session on top of parent 0x24add5dd989454987f3f42e769c4c939965af19ee95597b12cd4a90c80bfcb55    
2021-05-28 16:33:42  [Parachain] 💤 Idle (1 peers), best: #548 (0x24ad…cb55), finalized #548 (0x24ad…cb55), ⬇ 62 B/s ⬆ 69 B/s    
2021-05-28 16:33:42  [Parachain] 🎁 Prepared block for proposing at 549 [hash: 0x4b3fad5af9e75286bab059f8823d1df3d207d210fc99b518d8baba4e2f10abd3; parent_hash: 0x24ad…cb55; extrinsics (2): [0xdead…3ad6, 0x23de…fa7b]]    
2021-05-28 16:33:42  [Parachain] 🔖 Pre-sealed block for proposal at 549. Hash now 0x265ee3768c1272cb84d019c0e30923a93c5a4ed48be7286b60f95f675488ac37, previously 0x4b3fad5af9e75286bab059f8823d1df3d207d210fc99b518d8baba4e2f10abd3.    
2021-05-28 16:33:42  [Parachain] ✨ Imported #549 (0x265e…ac37)    
2021-05-28 16:33:42  [Parachain] Produced proof-of-validity candidate. block_hash=0x265ee3768c1272cb84d019c0e30923a93c5a4ed48be7286b60f95f675488ac37
2021-05-28 16:33:46  [Relaychain] 💤 Idle (3 peers), best: #1566 (0x0ded…27cb), finalized #1563 (0xf6d6…db1f), ⬇ 0.4kiB/s ⬆ 0.2kiB/s    
2021-05-28 16:33:47  [Parachain] 💤 Idle (1 peers), best: #548 (0x24ad…cb55), finalized #548 (0x24ad…cb55), ⬇ 24 B/s ⬆ 30 B/s    
2021-05-28 16:33:48  [Relaychain] ✨ Imported #1567 (0xb8a4…0d80)    
2021-05-28 16:33:48  [Relaychain] Collation wasn't advertised to any validator. candidate_hash=0x5e91d69009ba15babd6623de8ef904d1414f420617decd0f0c327c99de19daef pov_hash=0x3be71c220b2a2751e58428b4e5fa9173d562b1e60649e9238c6713d3d8571510
2021-05-28 16:33:48  [Parachain] Starting collation. relay_parent=0xb8a45e74bc3bb7676bbd88b3f24bb761239711e4cfe633434e10d3f76f880d80 at=0x24add5dd989454987f3f42e769c4c939965af19ee95597b12cd4a90c80bfcb55
2021-05-28 16:33:51  [Relaychain] 💤 Idle (3 peers), best: #1567 (0xb8a4…0d80), finalized #1564 (0xf954…a317), ⬇ 0.8kiB/s ⬆ 0.3kiB/s    
2021-05-28 16:33:52  [Parachain] 💤 Idle (1 peers), best: #548 (0x24ad…cb55), finalized #548 (0x24ad…cb55), ⬇ 0 ⬆ 0    
2021-05-28 16:33:54  [Relaychain] ✨ Imported #1568 (0xb39c…077d)    
2021-05-28 16:33:54  [Parachain] Starting collation. relay_parent=0xb39c782efbf46463479130975e3e2efef371fe9b1d5ea2dc90119b6c3ca6077d at=0x24add5dd989454987f3f42e769c4c939965af19ee95597b12cd4a90c80bfcb55
2021-05-28 16:33:56  [Relaychain] 💤 Idle (3 peers), best: #1568 (0xb39c…077d), finalized #1565 (0xb813…e5f7), ⬇ 0.7kiB/s ⬆ 0.3kiB/s    
2021-05-28 16:33:57  [Parachain] 💤 Idle (1 peers), best: #548 (0x24ad…cb55), finalized #548 (0x24ad…cb55), ⬇ 81 B/s ⬆ 0.1kiB/s
@bkchr bkchr transferred this issue from paritytech/cumulus May 31, 2021
@weichweich
Copy link
Contributor

We have the same issue. We updated all our dependencies to track 0.9.3 and are using the docker image parity/polkadot:v0.9.3 to run the relay chain. The logs are attached
issue-0.9.3.zip

@ordian
Copy link
Member

ordian commented Jun 2, 2021

I reproduced the issue locally with alice, bob at 6077ed0 and one polkadot-collator (commit). Looks like the issue is Invalid justification provided which leads to a fatal reputation change: https://github.com/paritytech/substrate/blob/4652f9e00f0e3079b9ed40ff806829f17fd1ddcf/client/network/src/protocol.rs#L1014. They might recover later since reputation goes to zero eventually.

After the collator receives an invalid justification from Alice, it's only connected to Bob on the collation protocol, but when it receives an invalid justification from Bob (at some random point in the future), it has 0 collation peers and hence Collation wasn't advertised to any validator.

INFO tokio-runtime-worker sc_network::protocol: [Relaychain] 💔 Invalid justification provided by Bob for #Block    
DEBUG tokio-runtime-worker parachain::network-bridge: [Relaychain] action="PeerDisconnected" peer_set=Collation peer=PeerId("Bob")

@mn13
Copy link
Author

mn13 commented Jun 2, 2021

@ordian Thank you for answer!

They might recover later since reputation goes to zero eventually.

Do you mean that it's a bug that's needed to be fixed or collators just need to wait for recover collation peers?

@ordian ordian added the I3-bug Fails to follow expected behavior. label Jun 2, 2021
@andresilva
Copy link
Contributor

Could you try running the validators with --no-beefy? This is currently only enabled on Rococo since it's still in experimental phase.

@ordian
Copy link
Member

ordian commented Jun 3, 2021

Indeed, this is a BEEFY justification and running the validators with --no-beefy seems to resolve the issue.

There is another issue, namely, if the validator's reputation goes below the banning threshold, the collator should try to reconnect at least a half a minute later, but that doesn't happen. Should probably be fixed with paritytech/substrate#9025.

@ordian
Copy link
Member

ordian commented Jun 13, 2021

Closing as paritytech/substrate#9075 and paritytech/substrate#9025 are merged.

@ordian ordian closed this as completed Jun 13, 2021
@zqhxuyuan
Copy link
Contributor

zqhxuyuan commented Jun 25, 2021

@ordian @bkchr I'm running westend-local with alice,bob, and westmint-local with alice,bob two both v0.9.5 version in one node(as local testnet, I change the epoch time to 3 min).

The parachain produce some block and stuck at block#16. after some dig log, I found the block stuck because of "PeerDisconnected" instead of "PeerMessages", causing "Collation wasn't advertised to any validator". Any idea why PeerDisconnected happend? as my understanding, this only one node started two relay chain and two collator, and the relay chain keep produce block normally, the network should be working fine.

block#15:
3139-1

block#16:
3139-2

PS: I've already open debug: -l parachain=trace, but can't find "Invalid justification..." just like above comment, so may be there're other issue causing collator found peer validator disconnect?

PPS: I also tried --no-beefy or latest v0.9.6, but with no luck, still has the same problem which stuck at some random block.
advertised

my command:

PCHAIN="westmint-local"
PID="1000"
VERSION="096"
CHAIN="westend-local"

./target/release/polkadot build-spec --chain $CHAIN --disable-default-bootnode --raw > ~/$CHAIN-$VERSION.json
nohup ./target/release/polkadot -l parachain=trace --no-beefy --validator --chain ~/$CHAIN-$VERSION.json -d /tmp/p_11 --ws-port 9990 --port 9991 --rpc-port=9995 --unsafe-rpc-external --rpc-methods=unsafe --rpc-cors=all --unsafe-ws-external --alice > 096_1.log &
nohup ./target/release/polkadot -l parachain=trace --no-beefy --validator --chain ~/$CHAIN-$VERSION.json -d /tmp/p_21 --ws-port 9992 --port 9993 --rpc-port=9996 --unsafe-rpc-external --rpc-methods=unsafe --rpc-cors=all --unsafe-ws-external --bob \
--bootnodes /ip4/127.0.0.1/tcp/9991/p2p/....... > 096_2.log &

./target/release/polkadot-collator export-genesis-state --chain $PCHAIN --parachain-id $PID > ~/$PCHAIN-$VERSION-genesis-state
./target/release/polkadot-collator export-genesis-wasm --chain $PCHAIN > ~/$PCHAIN-$VERSION-genesis-wasm
nohup ./target/release/polkadot-collator -l parachain=trace --chain $PCHAIN --parachain-id=$PID --collator -d /tmp/p_31 --port 10100 --rpc-port=10101 --ws-port 10102 --unsafe-rpc-external --rpc-methods=unsafe --rpc-cors=all --unsafe-ws-external --alice \
-- --no-beefy --execution wasm --chain ~/$CHAIN-$VERSION.json -d /tmp/p_41 --port 10103 --rpc-port=10104 --ws-port 10105 --unsafe-rpc-external --rpc-methods=unsafe --rpc-cors=all --unsafe-ws-external > 096_3.log &
nohup ./target/release/polkadot-collator -l parachain=trace --chain $PCHAIN --parachain-id=$PID --collator -d /tmp/p_51 --port 20100 --rpc-port=20101 --ws-port 20102 --unsafe-rpc-external --rpc-methods=unsafe --rpc-cors=all --unsafe-ws-external --bob \
-- --no-beefy --execution wasm --chain ~/$CHAIN-$VERSION.json -d /tmp/p_61 --port 20103 --rpc-port=20104 --ws-port 20105 --unsafe-rpc-external --rpc-methods=unsafe --rpc-cors=all --unsafe-ws-external > 096_4.log &

@transxask
Copy link

We found the same problem in 0.9.13.

@ordian
Copy link
Member

ordian commented Jan 8, 2022

@transxask we've merged a few fixes recently #4640, #4642. Could you try master branch? The fixes are going into the next release (0.9.15).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I3-bug Fails to follow expected behavior.
Projects
None yet
Development

No branches or pull requests

6 participants