-
Notifications
You must be signed in to change notification settings - Fork 535
Stuck chain due to block locked is incorrect
#232
Comments
block locked is incorrect
Hey @dankostiuk, Thank you for reaching out. Have you pulled the latest changes from the develop branch? I see you're running on an old commit. The error that's being printed is due to the block locking mechanism of IBFT (ethereum/EIPs#650), I am however not yet sure why you're running into this error. |
Hey @zivkovicmilos , We've updated to today's latest on develop and attempted to reproduce this issue by terminating instances within our validator set (N=15) until we saw blocks no longer being written. For any validators still online, we observed that a few became stuck in AcceptState, possibly due to an error caused by
The remaining nodes appeared stuck in RoundChangeState but could never exit as the number of MessageReq_RoundChange for their current round never met the number required (4) to transition to AcceptState. As the nodes stuck in RoundChangeState weren't on the same round, we figured a full restart of all our nodes would force them to start over on the same round, promptly meeting the required length-per-round number of roundMessages to exit to AcceptState. By doing so, all nodes resumed their normal ibft flows. After some investigation, 2 questions come to mind:
|
Hey @dankostiuk, Thank you for going the extra mile and selectively shutting down the nodes in an effort to reproduce the problem. I'll try to answer some of the questions below:
I've went through the round change state logic while going through issue #236 , and it works as it should - it conforms to the spec. We haven't planned out a refactor for this section in the foreseeable future, but I definitely agree that currently it's clunky. I would love to continue the discussion on #235 , since I believe the reason you're staying in the round change state is because nodes keep going down, and there's no way for you to get the correct number of messages to continue. |
Hey @zivkovicmilos, so spent more time on this. One idea that I have is that about half the nodes were stuck in accept state and the other half in round change. This is what a lot of the logs of our nodes looked like.
We saw this on about 7 nodes which was about half our fleet at the time. We thought this looked a little odd. Our thought was after timing out in AcceptState, it should get booted to the RoundChangeState, however, it didn't seem to happen here. So one hypothesis is there is nothing wrong with round change / round change timeouts, but nodes were not able to get out of AcceptState. Any ideas here? |
Closing as we haven't seen |
Stuck chain due to
block locked is incorrect
Description
We've encountered an issue where our testnet stopped writing new blocks as a result of all nodes getting stuck within RoundChangeState. As a result, a new proposer isn't able to elected to resume writing blocks.
Note that our testnet consisted of 15 validators.
As all nodes calculate the same next proposer (based on the last written block), i've tried to follow the steps taken for both these types of nodes which I'll refer to as ProposerNode and NormalNode -- both end up forever remaining in RoundChangeState:
ProposerNode:
if i.state.numPrepared() > i.state.NumValid() {
is never met since NumValid() is 8 ((15/3 - 1) * 2) and the numPrepared() is 1NormalNode:
state.NumValid()
which again is never met since NumValid() is 8 and we've only ever seen the length of round messages to be 1Since blocks aren't being added, the validator set will always be seen as 15 - I believe the
NumValid()
is calculated from the snapshot which isn't updated as a result of the blocks not being written. Attempting to vote drop to force-pick a different proposer doesn't do anything since again, it seems blocks need to be writing to do so.Your environment
OSX 11.5.2
ce793fa
develop
Expected behaviour
We likely froze the chain as a result of testing the txpool by sending high levels of txns, but I expect that if the chain gets into this state, it should be able to recover.
Actual behaviour
Even when removing the number of peers connected so that we only have 4 in total, the ValidatorSet is still 15 and we are still expecting the calculated proposer to begin writing blocks.
Proposed solution
Perhaps the validatorSet length can be derived another way without using the validator list from the latest snapshot. Not sure if this is possible though.
Logs
The text was updated successfully, but these errors were encountered: