Skip to content
This repository has been archived by the owner on Dec 4, 2024. It is now read-only.

Stuck chain due to block locked is incorrect #232

Closed
dankostiuk opened this issue Nov 20, 2021 · 5 comments
Closed

Stuck chain due to block locked is incorrect #232

dankostiuk opened this issue Nov 20, 2021 · 5 comments
Assignees
Labels
info needed More information needed

Comments

@dankostiuk
Copy link
Contributor

dankostiuk commented Nov 20, 2021

Stuck chain due to block locked is incorrect

Description

Update - See the logs below, it seems that our chain got stuck as a result of block locked is incorrect being thrown on the majority of the validator set.

We've encountered an issue where our testnet stopped writing new blocks as a result of all nodes getting stuck within RoundChangeState. As a result, a new proposer isn't able to elected to resume writing blocks.

Note that our testnet consisted of 15 validators.

As all nodes calculate the same next proposer (based on the last written block), i've tried to follow the steps taken for both these types of nodes which I'll refer to as ProposerNode and NormalNode -- both end up forever remaining in RoundChangeState:

ProposerNode:

  • upon startup, goes in ValidateState since we determine that we are the proposer, we then get the next message from msqQueue which always happens to be of type MessageReq_Prepare
  • the condition in ibft.go#runValidateState if i.state.numPrepared() > i.state.NumValid() { is never met since NumValid() is 8 ((15/3 - 1) * 2) and the numPrepared() is 1
  • we then fetch the next message from msqQueue which is nil so falls into RoundChangeState

NormalNode:

  • upon startup, since it is determined that we are not the proposer, we go into RoundChangeState and keep adding messages each round and comparing the length of round messages to state.NumValid() which again is never met since NumValid() is 8 and we've only ever seen the length of round messages to be 1

Since blocks aren't being added, the validator set will always be seen as 15 - I believe the NumValid() is calculated from the snapshot which isn't updated as a result of the blocks not being written. Attempting to vote drop to force-pick a different proposer doesn't do anything since again, it seems blocks need to be writing to do so.

Your environment

  • OS and version OSX 11.5.2
  • version of the Polygon SDK ce793fa
  • branch that causes this issue develop

Expected behaviour

We likely froze the chain as a result of testing the txpool by sending high levels of txns, but I expect that if the chain gets into this state, it should be able to recover.

Actual behaviour

Even when removing the number of peers connected so that we only have 4 in total, the ValidatorSet is still 15 and we are still expecting the calculated proposer to begin writing blocks.

Proposed solution

Perhaps the validatorSet length can be derived another way without using the validator list from the latest snapshot. Not sure if this is possible though.

Logs

2021-11-18T23:38:09.874Z [DEBUG] polygon.consensus.ibft: cycle: state=AcceptState sequence=173592 round=4
2021-11-18T23:38:09.874Z [INFO] polygon.consensus.ibft.acceptState: Accept state: sequence=173592
2021-11-18T23:38:09.874Z [INFO] polygon.consensus.ibft: current snapshot: validators=15 votes=0
2021-11-18T23:38:09.874Z [INFO] polygon.consensus.ibft: proposer calculated: proposer=0x097F8AB58FaAa6f282C3a50cAb0c1F43b26AA9aF block=173592
2021-11-18T23:38:09.884Z [DEBUG] polygon.consensus.ibft: state change: new=RoundChangeState
2021-11-18T23:38:09.884Z [DEBUG] polygon.consensus.ibft: cycle: state=RoundChangeState sequence=173592 round=4
2021-11-18T23:38:09.885Z [DEBUG] polygon.consensus.ibft: round change handle err: err="block locked is incorrect"
2021-11-18T23:38:09.885Z [DEBUG] polygon.consensus.ibft: local round change: round=5
2021-11-18T23:38:09.872Z [DEBUG] polygon.consensus.ibft: state change: new=AcceptState
@dankostiuk dankostiuk changed the title Stuck chain not able to elect a new proposer to resume writing blocks Stuck chain due to block locked is incorrect Nov 22, 2021
@zivkovicmilos zivkovicmilos self-assigned this Nov 23, 2021
@zivkovicmilos
Copy link
Contributor

Hey @dankostiuk,

Thank you for reaching out.

Have you pulled the latest changes from the develop branch? I see you're running on an old commit.

The error that's being printed is due to the block locking mechanism of IBFT (ethereum/EIPs#650), I am however not yet sure why you're running into this error.

@zivkovicmilos zivkovicmilos added the info needed More information needed label Nov 23, 2021
@dankostiuk
Copy link
Contributor Author

Hey @zivkovicmilos ,

We've updated to today's latest on develop and attempted to reproduce this issue by terminating instances within our validator set (N=15) until we saw blocks no longer being written.

For any validators still online, we observed that a few became stuck in AcceptState, possibly due to an error caused by discovery.go#run() that occurred around the same time:

[ERROR] polygon.network: failed to dial bootnode: err="rpc error: code = Unavailable desc = timed out waiting for server handshake"

The remaining nodes appeared stuck in RoundChangeState but could never exit as the number of MessageReq_RoundChange for their current round never met the number required (4) to transition to AcceptState.

As the nodes stuck in RoundChangeState weren't on the same round, we figured a full restart of all our nodes would force them to start over on the same round, promptly meeting the required length-per-round number of roundMessages to exit to AcceptState. By doing so, all nodes resumed their normal ibft flows.

After some investigation, 2 questions come to mind:

  1. Could a failed to dial bootnode error really hang nodes in their current IBFT state (acceptState in our case)?
  2. When several nodes in RoundStateChange are on the same sequence and different rounds, it seems that only the node on the lowest round can be the one to exit here (as future round nodes will ignore msgs from older rounds). This part of the system feels sensitive to error. Are there plans to improve this area of the code?

@zivkovicmilos
Copy link
Contributor

Hey @dankostiuk,

Thank you for going the extra mile and selectively shutting down the nodes in an effort to reproduce the problem.

I'll try to answer some of the questions below:

  1. This error only appears when setting up the discovery protocol, and it is not tied to any major logic panics - meaning it has no effect directly on the IBFT state machine.
  2. I'll paste the part from the IBFT spec here, since I can't anchor the correct section:

Screenshot 2021-11-24 at 18 43 43

I've went through the round change state logic while going through issue #236 , and it works as it should - it conforms to the spec.

We haven't planned out a refactor for this section in the foreseeable future, but I definitely agree that currently it's clunky.
Clunky, but it works (confirmed by manual / automatic tests).

I would love to continue the discussion on #235 , since I believe the reason you're staying in the round change state is because nodes keep going down, and there's no way for you to get the correct number of messages to continue.

@mrwillis
Copy link
Contributor

Hey @zivkovicmilos, so spent more time on this. One idea that I have is that about half the nodes were stuck in accept state and the other half in round change. This is what a lot of the logs of our nodes looked like.

Nov 18 23:30:32 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:32.127Z [INFO]  polygon: Data dir: path=/home/ubuntu/sx-node/mynode
Nov 18 23:30:32 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:32.132Z [INFO]  polygon.network: LibP2P server running: addr=/ip4/34.207.177.200/tcp/10001/p2p/16Uiu2HAkxUUzBnXb1iwjktg1K2t1Ai2f3wDEUBW9NsMHZSBvcruV
Nov 18 23:30:32 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:32.245Z [INFO]  polygon.consensus.ibft: validator key: addr=0x10Bb1c04c6B8906BBDc5Bae8272006e57F7EFB7E
Nov 18 23:30:32 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:32.259Z [INFO]  polygon.blockchain: Current header: hash=0x9fcee98faac2bfac2282429969b4ba190ed62424cbff6738ee5693639c66a4f1 number=173590
Nov 18 23:30:32 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:32.259Z [INFO]  polygon.blockchain: genesis: hash=0xf63461c6309c5832a99b2346895f8d78c8256331b85af3a43b4133532530be8d
Nov 18 23:30:32 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:32.260Z [INFO]  polygon: GRPC server running: addr=0.0.0.0:10000
Nov 18 23:30:32 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:32.260Z [INFO]  polygon.jsonrpc: http server started: addr=0.0.0.0:10002
Nov 18 23:30:32 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:32.260Z [INFO]  polygon.consensus.ibft: snapshot was not found, restore snapshot at beginning of current epoch: current epoch=1
Nov 18 23:30:32 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:32.262Z [INFO]  polygon.consensus.ibft: syncing past snapshots: from=100000 to=173590
Nov 18 23:30:37 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:37.152Z [INFO]  polygon.network: Peer connected: id=16Uiu2HAmP2bpgr3sLUUzt61bEzAL7ptdNXnvsfnwcxyrfCMFjxDf
Nov 18 23:30:37 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:37.159Z [INFO]  polygon.network: Peer connected: id=16Uiu2HAmERLox5fG95J9SRrrv8YQ1oEMF5XxU5L5Ew1VyXa3qrHF
Nov 18 23:30:37 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:37.170Z [INFO]  polygon.network: Peer connected: id=16Uiu2HAmFpTiQcctV2KNpdKm9FehHKFLTUC13oaBGVzvZta7iFMc
Nov 18 23:30:37 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:37.173Z [INFO]  polygon.network: Peer connected: id=16Uiu2HAmTvfiVHL6v7V9JmLdkM4jHHTWC34mMHqYDdw4HhLoFTYD
Nov 18 23:30:38 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:38.149Z [INFO]  polygon.network: Peer connected: id=16Uiu2HAmSWw35R2y5FJonwUMaoRmKt1WhCzgjCAGD1fsGtLzeBwf
Nov 18 23:30:38 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:38.467Z [INFO]  polygon.network: Peer connected: id=16Uiu2HAmC2AvVqyvjMr8BMPth9Lnf1fkFAc5bMGH2Y3E5NXRk3gR
Nov 18 23:30:39 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:39.544Z [INFO]  polygon.network: Peer connected: id=16Uiu2HAmVNa5jRxiLkDtVRirUTHLZ2CYCbyR1VeFH5eYVdyY4wUF
Nov 18 23:30:40 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:40.496Z [INFO]  polygon.network: Peer connected: id=16Uiu2HAmSYhVdFw1ShS372r3yDi5ecVDesNW14VwNce6KavMoDRM
Nov 18 23:30:40 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:40.559Z [INFO]  polygon.network: Peer connected: id=16Uiu2HAmGufbQZfBgkk6jvF7axaXKE7kbXq5CSfEokchmv2D3uiK
Nov 18 23:30:40 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:40.868Z [INFO]  polygon.network: Peer connected: id=16Uiu2HAmHPKMSKgbaLkRTgz9D6SstMoaiEwWUtCNn3UYD8HR6JJj
Nov 18 23:30:41 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:41.011Z [INFO]  polygon.network: Peer connected: id=16Uiu2HAkztynubfLkDDMggZxJNbZTU65M7x2vFNwyYzQm9Dz5Nuv
Nov 18 23:30:41 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:41.060Z [INFO]  polygon.network: Peer connected: id=16Uiu2HAmD2i4cysbCm3Kd6AdSjydMfV7idbB1fhjcxUMJkcDmRoP
Nov 18 23:30:41 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:41.412Z [INFO]  polygon.network: Peer connected: id=16Uiu2HAm6MdtxtjRUG7PnG7h1RjTRwnAniRmBnBy6u5athYpfSSU
Nov 18 23:30:42 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:42.152Z [INFO]  polygon.network: Peer connected: id=16Uiu2HAmBuezYDrZqvRxmHc7shTADWtuqSETNfPrT4Mg1NNFeseU
Nov 18 23:30:42 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:42.156Z [INFO]  polygon.network: Peer connected: id=16Uiu2HAm2Kdu9ecpvXViUv3n21AMdNtvWMe56RmaS4XUnKrQAZ1L
Nov 18 23:30:42 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:42.288Z [INFO]  polygon.network: Peer connected: id=16Uiu2HAmDyESKCkJ4kMF7tJYKNaQkz5xWAVnP9L1o3JkA3soJqyB
Nov 18 23:30:57 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:57.592Z [INFO]  polygon.consensus.ibft.acceptState: Accept state: sequence=173591
Nov 18 23:30:57 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:57.593Z [INFO]  polygon.consensus.ibft: current snapshot: validators=15 votes=0
Nov 18 23:30:57 ip-172-31-39-197 main[63679]: 2021-11-18T23:30:57.593Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0x1AC502B21b8b73d881dA7F5af6abDe15FC22EB33 block=173591
Nov 18 23:36:36 ip-172-31-39-197 main[63679]: 2021-11-18T23:36:36.595Z [INFO]  polygon.consensus.ibft.acceptState: Accept state: sequence=173591
Nov 18 23:36:36 ip-172-31-39-197 main[63679]: 2021-11-18T23:36:36.596Z [INFO]  polygon.consensus.ibft: current snapshot: validators=15 votes=0
Nov 18 23:36:36 ip-172-31-39-197 main[63679]: 2021-11-18T23:36:36.596Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0x9a5566c24a96Dc08e0A2404F23e93350a85f686e block=173591
Nov 18 23:36:36 ip-172-31-39-197 main[63679]: 2021-11-18T23:36:36.625Z [INFO]  polygon.blockchain: write block: num=173591 parent=0x9fcee98faac2bfac2282429969b4ba190ed62424cbff6738ee5693639c66a4f1
Nov 18 23:36:36 ip-172-31-39-197 main[63679]: 2021-11-18T23:36:36.630Z [INFO]  polygon.blockchain: new block: number=173591 hash=0xa8eaf9be701a2d9b8df4f6d60f25c3f77bf650d2c6f184aac62b8e48a8090fe6 txns=0 generation_time_in_sec=2178
Nov 18 23:36:36 ip-172-31-39-197 main[63679]: 2021-11-18T23:36:36.630Z [INFO]  polygon.blockchain: new head: hash=0xa8eaf9be701a2d9b8df4f6d60f25c3f77bf650d2c6f184aac62b8e48a8090fe6 number=173591
Nov 18 23:36:36 ip-172-31-39-197 main[63679]: 2021-11-18T23:36:36.630Z [INFO]  polygon.consensus.ibft: block committed: sequence=173591 hash=0xa8eaf9be701a2d9b8df4f6d60f25c3f77bf650d2c6f184aac62b8e48a8090fe6 validators=15 rounds=12 committed=9
Nov 18 23:37:15 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:15.529Z [INFO]  polygon.consensus.ibft.acceptState: Accept state: sequence=173592
Nov 18 23:37:15 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:15.529Z [INFO]  polygon.consensus.ibft: current snapshot: validators=15 votes=0
Nov 18 23:37:15 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:15.530Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0x64B1845355e8D85D62d7ea01CAec8Df7ecB09e84 block=173592
Nov 18 23:37:25 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:25.721Z [INFO]  polygon.consensus.ibft.acceptState: Accept state: sequence=173592
Nov 18 23:37:25 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:25.723Z [INFO]  polygon.consensus.ibft: current snapshot: validators=15 votes=0
Nov 18 23:37:25 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:25.724Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0x27F6Ae1f03986A374Cca5A4CF4f9d43324B9F09E block=173592
Nov 18 23:37:37 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:37.736Z [INFO]  polygon.consensus.ibft.acceptState: Accept state: sequence=173592
Nov 18 23:37:37 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:37.736Z [INFO]  polygon.consensus.ibft: current snapshot: validators=15 votes=0
Nov 18 23:37:37 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:37.737Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0xce6e21a1f84C68C9C2F45EA43B93F5525dfE2a95 block=173592
Nov 18 23:37:51 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:51.860Z [INFO]  polygon.consensus.ibft.acceptState: Accept state: sequence=173592
Nov 18 23:37:51 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:51.860Z [INFO]  polygon.consensus.ibft: current snapshot: validators=15 votes=0
Nov 18 23:37:51 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:51.863Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0x1AC502B21b8b73d881dA7F5af6abDe15FC22EB33 block=173592
Nov 18 23:38:08 ip-172-31-39-197 main[63679]: 2021-11-18T23:38:08.201Z [INFO]  polygon.network: Peer disconnected: id=16Uiu2HAmFpTiQcctV2KNpdKm9FehHKFLTUC13oaBGVzvZta7iFMc
Nov 18 23:38:09 ip-172-31-39-197 main[63679]: 2021-11-18T23:38:09.873Z [INFO]  polygon.consensus.ibft.acceptState: Accept state: sequence=173592
Nov 18 23:38:09 ip-172-31-39-197 main[63679]: 2021-11-18T23:38:09.873Z [INFO]  polygon.consensus.ibft: current snapshot: validators=15 votes=0
Nov 18 23:38:09 ip-172-31-39-197 main[63679]: 2021-11-18T23:38:09.873Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0x097F8AB58FaAa6f282C3a50cAb0c1F43b26AA9aF block=173592
Nov 18 23:38:35 ip-172-31-39-197 main[63679]: 2021-11-18T23:38:35.895Z [INFO]  polygon.consensus.ibft.acceptState: Accept state: sequence=173592
Nov 18 23:38:35 ip-172-31-39-197 main[63679]: 2021-11-18T23:38:35.895Z [INFO]  polygon.consensus.ibft: current snapshot: validators=15 votes=0
Nov 18 23:38:35 ip-172-31-39-197 main[63679]: 2021-11-18T23:38:35.896Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0xB73a1fBE13bAc3b92d6fc9c10a6068C8Edb3b19c block=173592
Nov 18 23:39:17 ip-172-31-39-197 main[63679]: 2021-11-18T23:39:17.948Z [INFO]  polygon.consensus.ibft.acceptState: Accept state: sequence=173592
Nov 18 23:39:17 ip-172-31-39-197 main[63679]: 2021-11-18T23:39:17.948Z [INFO]  polygon.consensus.ibft: current snapshot: validators=15 votes=0
Nov 18 23:39:17 ip-172-31-39-197 main[63679]: 2021-11-18T23:39:17.948Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0x369a595Acea231aEA9809ddab58117204838EA71 block=173592
Nov 18 23:40:31 ip-172-31-39-197 main[63679]: 2021-11-18T23:40:31.965Z [INFO]  polygon.consensus.ibft.acceptState: Accept state: sequence=173592
Nov 18 23:40:31 ip-172-31-39-197 main[63679]: 2021-11-18T23:40:31.965Z [INFO]  polygon.consensus.ibft: current snapshot: validators=15 votes=0
Nov 18 23:40:31 ip-172-31-39-197 main[63679]: 2021-11-18T23:40:31.966Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0x7C482A9234255433A96c1431ef92A8DaC5ebF488 block=173592

We saw this on about 7 nodes which was about half our fleet at the time. We thought this looked a little odd. Our thought was after timing out in AcceptState, it should get booted to the RoundChangeState, however, it didn't seem to happen here.

So one hypothesis is there is nothing wrong with round change / round change timeouts, but nodes were not able to get out of AcceptState. Any ideas here?

@dankostiuk
Copy link
Contributor Author

Closing as we haven't seen block locked is incorrect and #263 should result in a faster recovery time for stuck chains

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
info needed More information needed
Projects
None yet
Development

No branches or pull requests

3 participants