Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Candidate validation timeouts (t_good, t_bad, and t_ugly) #1656

Closed
wants to merge 2 commits into from

Conversation

burdges
Copy link
Contributor

@burdges burdges commented Aug 28, 2020

We only use wasmtime here, not wasmi. We never escalate and slash unless some approval checker times out (or reports invalid). Also, you never get slashed unless you're alpha slower or faster than 2/3rd of validators (or non-timeout invalidity reports).

We did not address non-timeout invalidity reports well enough here did we @AlfonsoCev ?

@AlistairStewart should review

We meant to open this from @AlfonsoCev machine earlier, but windows baffles me so maybe we botched it. Replaces #1628

Candidate validaton timeouts for wasmtime
@burdges burdges changed the title Candidate validation timesouts Candidate validation timesouts (t_good, t_bad, and t_ugly) Aug 28, 2020
@burdges burdges changed the title Candidate validation timesouts (t_good, t_bad, and t_ugly) Candidate validation timeouts (t_good, t_bad, and t_ugly) Aug 28, 2020
@AlfonsoCev
Copy link

Yeah, invalidity reports are not discussed yet. If the general approach seems to make sense I can add to the file a proposal to deal with invalidity reports.

some changes to deal with edge cases and deal with (non-timeout) invalidity reports
@AlfonsoCev
Copy link

I updated the note. Unfortunately I removed the t_good, t_bad, t_ugly notation and kept only two time thresholds (t_back, t_check). The new version handles edges cases better (as was pointed out by Al). It also explains how to deal with (non-timeout) invalidity reports, which is actually exactly the same way as timeout reports.

@burdges
Copy link
Contributor Author

burdges commented Sep 3, 2020

I'm happy with these new changes, but cannot approve since I'm the PR submitter.

We should discuss updating r_v perhaps, but not sure who should do so. Issues:

  • Can runtime and/or r_v exclude specific native functionality with deterministic runtimes, like (batch) verifying signatures, proofs, etc.?
  • Does r_v gets marked untrusted sometimes? On wasmtime upgrades? On PVF upgrades?
  • Does r_v depend upon the parachain? If yes it updates slowly. If no it may represent parachins less well. Should r_v really be multiple values, consisting of PVF and wasmtime version specific values? Or could PVFs be tied to specific wasmtime versions?
  • Do we ever monitor r_v for evidence of validator changing it incorrectly? Could this indicate any malicious behavior? Or should they be free to play with r_v?


When a validator executes a parablock $B$, it reports her execution time $t_v(B)$ in milliseconds along with her current estimated ratio $r_v$. From the reported pair $(t_v(B), r_v)$ we learn that $v$'s estimate of the *average execution time* $t_{average}(B)$ of this parablock is $t_v(B)/r_v$. A validator should accept parablock $B$ only if her estimate of $t_{average}(B)$ is less than a certain threshold, and this threshold is lower for backing validators and higher for checking validators, to minimize the possibility of conflict (i.e. a checker that reports "timeout" on a backed block). In case of conflict, we escalate and ask all validators to compute their own estimates of $t_{average}(B)$. We slash validators whose estimates are outliers in the corresponding distribution, for failing to estimate $t_{average}(B)$ accurately.

There are two constants $\alpha$ and $\beta$ (say, $\alpha=2$ and $\beta=3$), and two time limits $t_{back}$ and $t_{check}=\alpha^2\beta \cdot t_{back}$, expressed in milliseconds. Parameters $\alpha$, $\beta$ and $t_{back}$ are decided by governance. The intuition behind constant $\alpha$ and $\beta$ is that
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why beta=3? I'd think beta=2 works fine, no?

I'm actually much more worried about alpha=2, well alpha>2 sounds really likely. We could increase alpha and decrease beta temporarily during a wasmtime upgrade perhaps?

Copy link
Contributor Author

@burdges burdges Sep 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could phrase this in terms of assumptions I guess: We set alpha=2 because we believe r_v adjusted runtime should never be off by more than a factor of 2. That's fairly strong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this: We've no solid justification for alpha or beta right now, but alpha must be set by our variance estimation for machines.

@burdges
Copy link
Contributor Author

burdges commented Sep 3, 2020

Any thoughts @pepyakin ?

@burdges
Copy link
Contributor Author

burdges commented Sep 6, 2020

We're happy with roughly the r_v updating procedure specified here. We could implement r_v off-chain first, but use arithmetic compatible with on-chain operation, and move to tracking it on-chain later. It cannot be truly on-chain because you must declare your r_v when you register a new session key.

We need not track any other statistics on-chain or off-chain in nodes, but some additional polkadot tooling should collect other statistics for this stuff, like variance for parachains, etc.

Can you add a brief note about use on-chain compatible arithmetic @AlfonsoCev?

@burdges
Copy link
Contributor Author

burdges commented Sep 9, 2020

We might add some comment that doing "batch verification" for anything heavy, like signatures, snarks, Merkle proofs, etc. gives a nice deterministic runtime which can be launched in another thread and simply trusted to terminate. We could either add this time or simply account for it separately based upon the size of the batch calls.

Update: Added a security issue about this since we'll punt on this part here

@burdges
Copy link
Contributor Author

burdges commented Sep 14, 2020

We're basically done for this initial stage. Alfonso never pushed anything about doing on-chain compatible arithmetic though maybe.

We could mention that if the node runs the parachain code in native then it should maybe rerun in wasm under suspicious situations, but maybe this requires more design work. We might even specify interface guidelines for native code invoked from wasm, like requiring good somewhat deterministic timing estimates.

We should eventually discuss the timing metrics in greater technical detail: Is there more than one CPU clock we could use? etc.

@apopiak
Copy link
Contributor

apopiak commented Jan 15, 2021

Could you give some intuition on how this interacts with runtime upgrades/storage migrations?
It can easily happen that an upgrade block would take much longer than a regular one, so in order to enable that the timeout would have to be quite long which in turn would open validators to problems, no?

@burdges
Copy link
Contributor Author

burdges commented Jan 15, 2021

We're only discussing WASM execution time here, which only begins after we successfully download the block. Approval checkers learn their assignments from the VRF of the relay chain block that declares the candidate available, so then our 2/3rd honest assumption ensures reconstitution succeeds.

If reconstruction takes too long then we'll repeatedly announce this fact, but we'll eventually become no-show and be replaced. If this happens to too many people, then we'll behave as if under a massive DoS attack, and escalate to all checkers, which likely grinds finality to a halt, and the relay chain should cut its block production rate slowly escalates until some checker ceiling is reached, at which point we declare the block invalid, rewind the relay chain, and continue without that block. We eventually salsh if the block turns out invalid of course.

@apopiak
Copy link
Contributor

apopiak commented Jan 15, 2021

Seems like an unwanted consequence of a runtime upgrade of one parachain, though, to grind the whole thing to a halt.
As discussed elsewhere we'll probably want to investigate multi-block migrations in order to avoid this whenever possible.

See paritytech/substrate#7911

@burdges
Copy link
Contributor Author

burdges commented Jan 15, 2021

We currently put code onto the relay chain even I think, so actually it'll halt everything anyways. We've fucked up if we permit such large blocks anyways. We do know one other partial solution:

We distinguish "post code blocks" which provide new code but disallow all other functionality, so they only require availability but approvals become automatic. After availability, you could "switch code" whenever you like. At this point, we make backing and approval checkers who lack the code reconstruct the code before reconstructing the block. It's plausible some backing checkers already possess the large block, but mostly backers would take too long too, and often drop your blocks, so mostly just your parachain halts.

To be clear, we're worried about far far more serious soundness problems than mere chain halts in this issue. We're just assuming ad hoc methods like limiting code size suffice for your concerns.

@burdges burdges mentioned this pull request Apr 3, 2021
@burdges
Copy link
Contributor Author

burdges commented Apr 6, 2021

We should update this to discuss non-timeout faults, like the wasm runner receiving SIGKILL due to violating rlimit, etc.

We'll halt parachains whenever a super majority judges the code flawed, meaning any invalidity condition besides the PVF detecting the candidate being invalid, so including both timeouts and other non-timeout system faults like violating rlimit.

If we slash, either 100% for false validity claims, or the smaller slash for false invalidity claims, and the claim says timeout or non-timeout system fault, then the slashed validator should demonstrate their configuration that causes either the false validity cliam or the timeout or fault to governance. If correct, then governance refunds the slash and halts the parachain.

At these points, the parachain must convince governance they've fixed the code before continuing. We'll let governance and the community figure out if and when parachains should ever compensate temporarily slashed nodes for lost rewards.

@cla-bot-2021
Copy link

cla-bot-2021 bot commented Jun 3, 2021

User @burdges, please sign the CLA here.

@eskimor
Copy link
Member

eskimor commented Dec 22, 2021

I have been pondering about this and this is what I came up with:

So, why do we slash? We slash so tries of getting an invalid block in, to do some variant of a double spend for example lead to gambler's ruin.

The Underlying Problem

So why do we care about timeouts - could we not just abandon a candidate if enough approval checkers say: "I ran into a timeout validating this"?

Answer: No, because this could be abused by an attacker to get free tries in getting a bad candidate in. Just craft a candidate that takes ages to execute. Approval checkers are either your own guys - who don't have to execute the candidate at all or good guys, which will timeout and you just successfully avoided slashing on a failed attempt.

Non Issues

Therefore timeouts must mean an invalid vote and in the above scenario, that will be fine - a majority votes invalid and the attacker gets slashed. But what if a candidate was not 100% malicious as in the previous paragraph, but just really heavy so that some validators run into the timeout - they would report invalid. If they are in the minority the will get slashed mildly ... which should actually be fine: If the slash is mild enough, you just got punished for being slow ... better get faster hardware, better monitor your hardware and have alerts on execution times - we measure CPU time, not wall clock time - so heavy load is not an excuse.

We can have docs for operators informing them, that they should monitor their execution times to avoid slashes for slow hardware.

We do have conflicting requirements of not making those false invalid vote slashes too mild for spamming considerations, but that could be resolved relatively easily. (E.g. don't only disable validators if they lost enough stake, but also if they have been on the losing side of a dispute more often than x times in the last session.)

Note: If that was done on purpose, e.g. an attacker ignoring the backing timeout, but provide a candidate just passing the approval checking timeout on his machine, he might have managed to produce mild slashes on only slightly slower validators at the high risk of getting slashed 100% itself. If he wants to minimize that risk, then again only real slow validators will get slashed. If he is fine with getting slashed 100%, then this brings us right to the next section:

The Real Issue

The more problematic issue is the other way round: Some validators being super fast! In that case those super fast validators will be in the minority of approving/backing a deemed invalid candidate. We just slashed validators 100% for doing an excellent job!

How big is the issue though? In backing we have way stricter timeouts than in approvals - so an honest backer running into that issue, must really be extraordinary fast.

A malicious backer obviously could craft such a candidate and could cause fast honest validators to get slashed, at the cost of losing all of their own stake as well. Not a very attractive attack: You end up in a lose - lose situation, with nobody gaining anything.

Still the backer could have been hacked, for example. This can be mitigated by requiring a backing vote threshold greater than 1, which should already provide sufficient protection: If we assume successful decentralization, it should be very unlikely for a hacker to have hijacked more than one validator.

Summary & Considerations

With mild slashes for voting invalid on valid candidates, significantly large number of required backing votes and good values for backing and approval checking timeouts, we should get very far without adding even more complexity to an already very complex system.

While 2 bad validators are obviously way less than our usual byzantine assumption, the integrity of the system was never at stake, so I would argue this makes that weaker requirement justified.

To avoid slashing fast honest validators 100% we should make sure to make the ratio approval-validation-timeout/backing-validation-timeout is always significantly large to make that risk negligible.

@burdges
Copy link
Contributor Author

burdges commented Dec 22, 2021

If I understand you're making two statements:

  1. We could make due with a simpler timeout system since backing voters should be way stricter than approval voters.
  2. Increasing the backing threshold reduces the odds of the fast validator slashes.

Interesting points.. I kinda believe them against random events, which is mostly what this PR handles too, but they're iffy against adversarial parachains..

I'm dubious about 2 in particular because an adversarial parachain could likely identify the fastest validators from other behavior, and then simply wait until they saw all upcoming backers being fast, so zero slashing for the attacker, and pocketing the reporting rewards.

We cannot depend solely upon timeouts here anyways, so imho we should do some work towards normalized time reporting by validators, which firstly means CPU timers, not the vague "milliseconds" written here. Indeed even the ratios may not make sense..

Are the historic execution time ratios described here useful? I suppose ratios being cross-parachain is a risky oversimplification. Yet, we're definitely not going to be doing parachain runtime testing across all validators, so are they still useful if they're cross-parachain?

It's similarly true that if this proposal is too complex then we can tighten it around where the real damage gets done, or even tolerate more risk by being stricter elsewhere, primarily governance refunding slashes once we witness some machines running fast enough.

At the extreme if this blows up several times then parachains development could become a much more "interactive" process, with auditors signing off on runtime upgrades or whatever. At the sci-fi extreme, we could even make validators produce a zero-knowledge proof of correct execution before being slashed, or even complicate our interactive proof (neither sounds realistic).

As an aside, we've another adversarial behavior in which abusive backers permit extra large parachain blocks into the relay chain, which perhaps warrants forgoing their backer rewards. A prioir, 2 or 3 backers makes this rare enough that maybe it could be ignored, except it risks becoming a regular abuse.

@burdges
Copy link
Contributor Author

burdges commented Dec 22, 2021

Anyways, I'm dubious about the ratios here for sure, but at least superficially the alpha and beta part still looks sensible because it gives assurances. We want backer / approval times to be less than 6 or some reason?

@eskimor
Copy link
Member

eskimor commented Dec 22, 2021

Anyways, I'm dubious about the ratios here for sure, but at least superficially the alpha and beta part still looks sensible because it gives assurances. We want backer / approval times to be less than 6 or some reason?

Not that I know of. I guess we are a bit time sensitive in approval checking with noshows and such.

For the large parachain block issue: I don't quite follow. Large parachain blocks only play a role in backing and availability recovery/approvals. And approval checkers will reject the candidate if it exceeds the size limit.

For the malicious parachain case: If we are free to make approval validation and backing validation timeouts differ enough, to make it practically impossible that a fast validator can validate within the backing timeout, while the majority can't even in the longer approval timeout, then this should also not be an issue. The question is what are our practical limits for allowed timeout ratios? What limiting factors do we have?

A very short backing validation timeout has obvious issues. A very long approval validation timeout would lead to noshows and thus needless work, harming performance and delaying finality. 🤔

@burdges
Copy link
Contributor Author

burdges commented Dec 23, 2021

In this PR, alpha=2 and beta=3 yields approval check timeouts being 6 times as long as backing timeouts, which sounds fine I guess? I think no show delays are like 12 or 24, not sure anymore.

We're anyways unsure if nodes rebroadcasting their expected timeout estimates helps. It does not enter into the picture here, unless I forgot something.

@burdges burdges mentioned this pull request Sep 1, 2022
9 tasks
@bkchr
Copy link
Member

bkchr commented May 8, 2023

Stale.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants