Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider panic one epoch before protocol upgrade if unsupported #12056

Open
Longarithm opened this issue Sep 5, 2024 · 0 comments
Open

Consider panic one epoch before protocol upgrade if unsupported #12056

Longarithm opened this issue Sep 5, 2024 · 0 comments

Comments

@Longarithm
Copy link
Member

Context

Currently, if node missed protocol upgrade announcement, it will panic on the first block with higher protocol version than it supports: https://github.com/near/nearcore/blob/master/chain/chain/src/chain.rs#L2139-L2143

To recover, only the binary upgrade is required. This is because in the usual case, only the first block/chunk of the NEW protocol version could have led to not upgraded node getting stuck. Next chunk after that would have different previous state root, which would diverge from state root in ChunkExtra. To avoid saving invalid state transition, we panic proactively.

Problem

Upgrade 70 -> 71 is one of specific examples where epoch info generation changes. As described above, it leads to the account state mismatch in the start of the epoch, next of which is the first one with new protocol version. We don't panic one epoch in advance, as Bowen mentioned - because usually we are still able to produce/process this epoch correctly and it doesn't make sense to miss rewards for that. So nodes enter invalid state and get stuck.

Fix idea 1

For the future upgrades, panic one epoch in advance. More concretely - here, if next_next_epoch_version > PROTOCOL_VERSION https://github.com/near/nearcore/blob/master/chain/epoch-manager/src/lib.rs#L734, we add a panic with same message as on the link above.

In such case it seems that validator would need only to upgrade the binary - no new snapshot would be required. It's not clear whether it is worth the effort.

Drawback

If validator was able to process that one epoch, it will miss rewards for it because of the panic.

Fix idea 2

Find whether it is necessary to lock account' stake one epoch in advance. If it doesn't, we could avoid invalid state transition one epoch in advance, it would happen only when epoch with new protocol version appears, which is natural to expect and which is already handled.

Fix idea 3

If node starts to observe higher protocol versions than it supports - in block infos, I guess - start actively displaying warnings that protocol may be about to upgrade.

Fix idea 4

If a node does not know the voted protocol version, take a snapshot before creating the new epoch and discard it when they upgrade to the latest binary. This way, when they reach the new protocol version and stall, they can upgrade the binary and if that does not fix it, revert to the snapshot.

There will be some storage increase if they miss to upgrade, due to the snapshot, but someone also pays for all the AWS traffic.

Full thread https://near.zulipchat.com/#narrow/stream/308695-nearone.2Fprivate/topic/incorrectly.20applied.20proposal/near/467585317

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant