You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Dec 4, 2024. It is now read-only.
When debugging our latest testnet crash (#235 and #232), we noticed a lot of the nodes got stuck in AcceptState, unable to get pushed out to RoundChangeState. So essentially we had a bunch of nodes in RoundChangeState, waiting for others to get into RoundChangeState, and some nodes in AcceptState unable to get booted to RoundChangeState.
We're not quite sure how or why this happens yet.
Unless I'm missing something, in runAcceptState, it looks like the only way this can happen is if getNextMessage continues to return a false value in the second parameter.
I have investigated this issue and concluded the following:
The method you mentioned getNextMessage, returns false only when the "closeCh" channel is closed. In that case, the node is being shut down and all code execution in the ibft state machine will stop, so this isn’t relevant for this situation.
Exponential timeout is achieved by adding 2 to the power of the current round number to the base timeout of 10 seconds, so this implies that the node isn’t stuck in the "accept state" because rounds are progressing, effectively the node is constantly switching between the "accept", "validate" and "round change" states.
The misleading thing here is that no logs are indicating that rounds are progressing, so this is something that I will address.
The situation you described is a consequence of the network falling out of consensus, meaning that not enough validator nodes are functioning (more than 1/3 of total validators). You can easily reproduce this by, for example, starting the 4 node cluster, shutting down 2 nodes, and observing the logs on running nodes which will show identical patterns to those you left in the description.
To summarize, I haven’t found any problems in the code logic for the "accept state" regarding the situation you experienced.
@brkomir thanks for looking into this. I just checked on 3. and yes as there wasn't debug logging available, it's possible they were cycling between the states.
We were 100% confident that at the time that all the nodes were up and running, so we do believe there is an issue somewhere.
Then it is possible the logs were the same under the hood as in #248.
Added logging would be great to see the state changes at the INFO level. We can close this for now.
AcceptState
not timing out toRoundChangeState
Description
When debugging our latest testnet crash (#235 and #232), we noticed a lot of the nodes got stuck in AcceptState, unable to get pushed out to RoundChangeState. So essentially we had a bunch of nodes in RoundChangeState, waiting for others to get into RoundChangeState, and some nodes in AcceptState unable to get booted to RoundChangeState.
We're not quite sure how or why this happens yet.
Unless I'm missing something, in
runAcceptState
, it looks like the only way this can happen is ifgetNextMessage
continues to return afalse
value in the second parameter.Your environment
Steps to reproduce
Expected behaviour
After timing out, it should get pushed to
RoundChangeState
according to the specActual behaviour
Nodes get stuck in AcceptState in an exponentially-increasing timeout loop.
Logs
The text was updated successfully, but these errors were encountered: