-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consensus failure after validator is slashed #1197
Comments
After an extensive debugging session, we were able to track the problem to a bug in the staking data model. We are making some of the debugging tooling available in a new Note the address here:
Corresponds to SummaryWe observed the following sequence of events in the blockchain:
At this point,
At this point, we loop through that set of secondary records and for each one attempt to load the primary record. We include a sanity check to ensure that the primary record exists. But since the primary record got deleted while the secondary one did not, the sanity check failed and we paniced with The reason the secondary record did not get deleted is as follows:
Since the computed power changes, when At height 60522, when the other validator unbonds and we loop through the secondary records, we find this stale record fails the sanity check, so we panic. Please note we are taking an explicit "FAIL CLOSED" approach to the software - we include liberal sanity checks in the code to ensure certain invariants aren't being violated. This ensures that we discover bugs sooner than later and halt the blockchain to fix them, rather than the chain continuing to run and be potentially exploited. ConclusionIndependent of this bug, the staking specification was updated to not have inflationary coins automatically bonded. This would remove the relevance of the exchange rate and actually prevent this kind of bug. We will make an immediate release that has some minor fixes and also removed bonded inflation for the time being, in order to allow a new testnet to restart. Note this release will be breaking and will require a new chain-id and restarting from the genesis block. In the near-term, we will release a new version of the staking module that reflects the new design. Additionally, we are working on randomized testing to catch more bugs sooner. We are also beginning a careful code review in an effort to improve the structure of the code. |
Thanks @ebuchman for the root cause analysis, that was very helpful. Will this be linked to in release notes? If not I think this should be better documented by having a dedicated place for it, release notes would be a good place for that. |
It's already been added to the STATUS.md: https://github.com/cosmos/cosmos-sdk/blob/master/cmd/gaia/testnets/STATUS.md#june-13-2018-230-est---published-postmortem-of-gaia-6001-failure And yes, we will also link to it from the new release notes. Think that's all sufficient? Or should we copy it for instance into a markdown file in the testnets/gaia-6001 folder ? |
I think just linking to this somewhere intuitive or that people would normally check between releases (e.g., release notes) would allow validators to understand what got into a new release and what was wrong with a previous release or why something was removed/added. This is quite sufficient, at least for me, thanks. |
Fixed in v0.19.0 on master. See https://github.com/cosmos/cosmos-sdk/tree/master/cmd/gaia/testnets for connecting to the new gaia-6002. Thanks all! |
My validator (Add: 295C0821D6D2EC71772E86773CD7F46F072CB764) is supposed to got slashed, but somehow it still send out the pre-vote messages on the same height, then the network has a consensus failure.
consensus state:
related error message:
The text was updated successfully, but these errors were encountered: