Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Staking: Chilled validators should wait a fixed period before they can validate again. #420

Open
Ank4n opened this issue Mar 26, 2023 · 4 comments

Comments

@Ank4n
Copy link
Contributor

Ank4n commented Mar 26, 2023

Context

When a validator misbehaves (such as missing their block authoring slot), they can be slashed and chilled. Being chilled means, their validator intention is withdrawn. We made a decision to preserve the approval stakes (record of nominators backing this validator) of the validator as this misbehaviour could sometimes be a result of a protocol bug. While slashes can be cancelled, approval stakes can not be restored easily. And losing approval stakes (potentially from a bug and not their misbehaviour) would mean losing all the nominators supporting them when they resume validation.

Problem

A validator after being chilled can put forward their validator intention again before the end of the next era effectively continuing being part of the active set. A nominator who put their stake behind this validator is continuously exposed to this nominator.

Proposed Solution

A validator once they are chilled, has to wait X eras (should be configurable) before they can validate again. This would mean a validator is kicked out of the active set for at least X number of eras and would give Nominators some more time to react to the slashing event and change their nominations if they so wish.

@burdges
Copy link

burdges commented Mar 26, 2023

We do not have any punishment based reasons for chilling validators. We want to know the financial costs of a punishment, but it's harder to compute this cost than the cost of removing era points or currency.

We chill validators largely because their spam might create liveness issues. In principle, if an honest & competent node operator believe they fixed the problem then we should let them back in. If they're simply wrong then we can always chill them again. If they're not honest or not competent, like if they auto-unchill themselves, then we need slashing to do something more serious.

We also chilled validators to protect their nominators, so we removed their nominators. We've found this extremely annoying in practice, so we've stopped removing nominators in some cases, but it's obviously be nice if nominators do not really get slashed too badly just because of an impatient validator operator.

It's possible a waiting period provides a useful tool in achieving a compromise here, like maybe 24 hours, so that people anywhere in the world have a chance to be awake to react. It depends somewhat upon what else is possible though, as this was never meant as a punishment.

@kianenigma
Copy link
Contributor

The context you explained is basically paritytech/substrate#6835, which we closed via paritytech/substrate#12420

A validator once they are chilled, has to wait X eras (should be configurable) before they can validate again. This would mean a validator is kicked out of the active set for at least X number of eras and would give Nominators some more time to react to the slashing event and change their nominations if they so wish.

I think this is now a direct consequence of paritytech/substrate#12420, and I am not super keen on creating too many complicated systems that hand-hold nominators in such events. Nominators are paid to actively monitor the system and react to slashes, just as validators do. I would put my effort into building the tools needed for this, not protecting lazy nominators.

@dcolley
Copy link
Contributor

dcolley commented Mar 27, 2023

An operator may need to self-chill a validator from time to time. Does that mean they can't re-join immediately?

I had a validaor that got chilled in the last 24 hours when 0.9.40 release started causing havoc (polkadot paritytech/substrate#6956). After downgrading to 0.9.39 (30 mins to recompile) and restarting I was able to rejoin the active set with a validator functioning correctly.

@Ank4n in your proposed solution you mix slash and chill. Which one is the problem that needs to be punished?

For slashes you can request compensation if you can prove there was no malintent, but there is no method for getting back nominations... getting enough nominators can take months.

In the cases where a slash and chill happens at the same time, it would be helpful to have a chill-reason in the chillOther extrinsic. Or perhaps we can infer the chill-reason when the chill and slash operations can be found in the same batch?

Are there any stats on how often this is occuring in the wild. Without knowing the scale of the issue, and what problem this causes on the chain, the proposal seems unecessarily harsh on honest operators (esp. when chill happens due to protocol issues).

[Honest] Validator operators live and die by their reputation, so I would like to know how many nominators are being affected by malicious actions from validators?

@Ank4n
Copy link
Contributor Author

Ank4n commented Mar 27, 2023

An operator may need to self-chill a validator from time to time. Does that mean they can't re-join immediately?

I had a validaor that got chilled in the last 24 hours when 0.9.40 release started causing havoc (polkadot paritytech/substrate#6956). After downgrading to 0.9.39 (30 mins to recompile) and restarting I was able to rejoin the active set with a validator functioning correctly.

@Ank4n in your proposed solution you mix slash and chill. Which one is the problem that needs to be punished?

For slashes you can request compensation if you can prove there was no malintent, but there is no method for getting back nominations... getting enough nominators can take months.

In the cases where a slash and chill happens at the same time, it would be helpful to have a chill-reason in the chillOther extrinsic. Or perhaps we can infer the chill-reason when the chill and slash operations can be found in the same batch?

Are there any stats on how often this is occuring in the wild. Without knowing the scale of the issue, and what problem this causes on the chain, the proposal seems unecessarily harsh on honest operators (esp. when chill happens due to protocol issues).

[Honest] Validator operators live and die by their reputation, so I would like to know how many nominators are being affected by malicious actions from validators?

Thanks for your input @dcolley.

The proposal was only for validators who got chilled by the protocol as a consequence of a misbehaviour. Self chill would not be affected. Also, the nominations are restored when validator is able to validate again, as is the case today.

But as you also have mentioned and after talking to few people, this proposal does not add much value except it protects the nominator to some degree by giving them some more time to adjust their exposures. The good way to solve this problem though is to build better tooling for nominators to monitor their nominations. The validators are already punished by slashing and this additional proposed punishment is not required.

The issue is marked to be revisited someday if we notice some future on-chain behaviour that justifies this, but for now we have decided to shelve this proposal.

@juangirini juangirini transferred this issue from paritytech/substrate Aug 24, 2023
bkchr pushed a commit that referenced this issue Apr 10, 2024
* fixed clippy warnings

* Revert "Actually use pinned nightly version when building runtimes (#465)"

This reverts commit dedddb6.

* Revert "Pin Rust Nightly Version (#420)"

This reverts commit 8902ac2.

* fix after revert

* another fix after revert

* more clippy fixes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 📕 Backlog
Development

No branches or pull requests

5 participants