`AcceptState` not timing out to `RoundChangeState` #243

mrwillis · 2021-11-25T15:51:29Z

`AcceptState` not timing out to `RoundChangeState`

Description

When debugging our latest testnet crash (#235 and #232), we noticed a lot of the nodes got stuck in AcceptState, unable to get pushed out to RoundChangeState. So essentially we had a bunch of nodes in RoundChangeState, waiting for others to get into RoundChangeState, and some nodes in AcceptState unable to get booted to RoundChangeState.

We're not quite sure how or why this happens yet.

Unless I'm missing something, in runAcceptState, it looks like the only way this can happen is if getNextMessage continues to return a false value in the second parameter.

Your environment

OS and version Ubuntu 20
version of the Polygon SDK 2bebf50
branch that causes this issue

Steps to reproduce

Tell us how to reproduce this issue
Where the issue is, if you know
Which commands triggered the issue, if any

Expected behaviour

After timing out, it should get pushed to RoundChangeState according to the spec

Actual behaviour

Nodes get stuck in AcceptState in an exponentially-increasing timeout loop.

Logs

Nov 18 23:37:15 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:15.529Z [INFO]  polygon.consensus.ibft: current snapshot: validators=15 votes=0
Nov 18 23:37:15 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:15.530Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0x64B1845355e8D85D62d7ea01CAec8Df7ecB09e84 block=173592
Nov 18 23:37:25 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:25.721Z [INFO]  polygon.consensus.ibft.acceptState: Accept state: sequence=173592
Nov 18 23:37:25 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:25.723Z [INFO]  polygon.consensus.ibft: current snapshot: validators=15 votes=0
Nov 18 23:37:25 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:25.724Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0x27F6Ae1f03986A374Cca5A4CF4f9d43324B9F09E block=173592
Nov 18 23:37:37 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:37.736Z [INFO]  polygon.consensus.ibft.acceptState: Accept state: sequence=173592
Nov 18 23:37:37 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:37.736Z [INFO]  polygon.consensus.ibft: current snapshot: validators=15 votes=0
Nov 18 23:37:37 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:37.737Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0xce6e21a1f84C68C9C2F45EA43B93F5525dfE2a95 block=173592
Nov 18 23:37:51 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:51.860Z [INFO]  polygon.consensus.ibft.acceptState: Accept state: sequence=173592
Nov 18 23:37:51 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:51.860Z [INFO]  polygon.consensus.ibft: current snapshot: validators=15 votes=0
Nov 18 23:37:51 ip-172-31-39-197 main[63679]: 2021-11-18T23:37:51.863Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0x1AC502B21b8b73d881dA7F5af6abDe15FC22EB33 block=173592
Nov 18 23:38:08 ip-172-31-39-197 main[63679]: 2021-11-18T23:38:08.201Z [INFO]  polygon.network: Peer disconnected: id=16Uiu2HAmFpTiQcctV2KNpdKm9FehHKFLTUC13oaBGVzvZta7iFMc
Nov 18 23:38:09 ip-172-31-39-197 main[63679]: 2021-11-18T23:38:09.873Z [INFO]  polygon.consensus.ibft.acceptState: Accept state: sequence=173592
Nov 18 23:38:09 ip-172-31-39-197 main[63679]: 2021-11-18T23:38:09.873Z [INFO]  polygon.consensus.ibft: current snapshot: validators=15 votes=0
Nov 18 23:38:09 ip-172-31-39-197 main[63679]: 2021-11-18T23:38:09.873Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0x097F8AB58FaAa6f282C3a50cAb0c1F43b26AA9aF block=173592
Nov 18 23:38:35 ip-172-31-39-197 main[63679]: 2021-11-18T23:38:35.895Z [INFO]  polygon.consensus.ibft.acceptState: Accept state: sequence=173592
Nov 18 23:38:35 ip-172-31-39-197 main[63679]: 2021-11-18T23:38:35.895Z [INFO]  polygon.consensus.ibft: current snapshot: validators=15 votes=0
Nov 18 23:38:35 ip-172-31-39-197 main[63679]: 2021-11-18T23:38:35.896Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0xB73a1fBE13bAc3b92d6fc9c10a6068C8Edb3b19c block=173592
Nov 18 23:39:17 ip-172-31-39-197 main[63679]: 2021-11-18T23:39:17.948Z [INFO]  polygon.consensus.ibft.acceptState: Accept state: sequence=173592
Nov 18 23:39:17 ip-172-31-39-197 main[63679]: 2021-11-18T23:39:17.948Z [INFO]  polygon.consensus.ibft: current snapshot: validators=15 votes=0
Nov 18 23:39:17 ip-172-31-39-197 main[63679]: 2021-11-18T23:39:17.948Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0x369a595Acea231aEA9809ddab58117204838EA71 block=173592
Nov 18 23:40:31 ip-172-31-39-197 main[63679]: 2021-11-18T23:40:31.965Z [INFO]  polygon.consensus.ibft.acceptState: Accept state: sequence=173592
Nov 18 23:40:31 ip-172-31-39-197 main[63679]: 2021-11-18T23:40:31.965Z [INFO]  polygon.consensus.ibft: current snapshot: validators=15 votes=0
Nov 18 23:40:31 ip-172-31-39-197 main[63679]: 2021-11-18T23:40:31.966Z [INFO]  polygon.consensus.ibft: proposer calculated: proposer=0x7C482A9234255433A96c1431ef92A8DaC5ebF488 block=173592

The text was updated successfully, but these errors were encountered:

brkomir · 2021-11-29T17:05:20Z

Hey @mrwillis, thank you for reaching out!

I have investigated this issue and concluded the following:

The method you mentioned getNextMessage, returns false only when the "closeCh" channel is closed. In that case, the node is being shut down and all code execution in the ibft state machine will stop, so this isn’t relevant for this situation.
Exponential timeout is achieved by adding 2 to the power of the current round number to the base timeout of 10 seconds, so this implies that the node isn’t stuck in the "accept state" because rounds are progressing, effectively the node is constantly switching between the "accept", "validate" and "round change" states.
The misleading thing here is that no logs are indicating that rounds are progressing, so this is something that I will address.
The situation you described is a consequence of the network falling out of consensus, meaning that not enough validator nodes are functioning (more than 1/3 of total validators). You can easily reproduce this by, for example, starting the 4 node cluster, shutting down 2 nodes, and observing the logs on running nodes which will show identical patterns to those you left in the description.

To summarize, I haven’t found any problems in the code logic for the "accept state" regarding the situation you experienced.

mrwillis · 2021-11-29T18:27:48Z

@brkomir thanks for looking into this. I just checked on 3. and yes as there wasn't debug logging available, it's possible they were cycling between the states.

We were 100% confident that at the time that all the nodes were up and running, so we do believe there is an issue somewhere.

Then it is possible the logs were the same under the hood as in #248.

Added logging would be great to see the state changes at the INFO level. We can close this for now.

brkomir self-assigned this Nov 29, 2021

brkomir closed this as completed Dec 1, 2021

igorcrevar pushed a commit that referenced this issue Jun 12, 2023

Add rest of types for eip-712 (#243)

6e8d38b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`AcceptState` not timing out to `RoundChangeState` #243

`AcceptState` not timing out to `RoundChangeState` #243

mrwillis commented Nov 25, 2021 •

edited

Loading

brkomir commented Nov 29, 2021 •

edited

Loading

mrwillis commented Nov 29, 2021 •

edited

Loading

AcceptState not timing out to RoundChangeState #243

AcceptState not timing out to RoundChangeState #243

Comments

mrwillis commented Nov 25, 2021 • edited Loading