-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Candidate validation timeouts (t_good, t_bad, and t_ugly) #1656
Conversation
Candidate validaton timeouts for wasmtime
Yeah, invalidity reports are not discussed yet. If the general approach seems to make sense I can add to the file a proposal to deal with invalidity reports. |
some changes to deal with edge cases and deal with (non-timeout) invalidity reports
I updated the note. Unfortunately I removed the t_good, t_bad, t_ugly notation and kept only two time thresholds (t_back, t_check). The new version handles edges cases better (as was pointed out by Al). It also explains how to deal with (non-timeout) invalidity reports, which is actually exactly the same way as timeout reports. |
I'm happy with these new changes, but cannot approve since I'm the PR submitter. We should discuss updating r_v perhaps, but not sure who should do so. Issues:
|
|
||
When a validator executes a parablock $B$, it reports her execution time $t_v(B)$ in milliseconds along with her current estimated ratio $r_v$. From the reported pair $(t_v(B), r_v)$ we learn that $v$'s estimate of the *average execution time* $t_{average}(B)$ of this parablock is $t_v(B)/r_v$. A validator should accept parablock $B$ only if her estimate of $t_{average}(B)$ is less than a certain threshold, and this threshold is lower for backing validators and higher for checking validators, to minimize the possibility of conflict (i.e. a checker that reports "timeout" on a backed block). In case of conflict, we escalate and ask all validators to compute their own estimates of $t_{average}(B)$. We slash validators whose estimates are outliers in the corresponding distribution, for failing to estimate $t_{average}(B)$ accurately. | ||
|
||
There are two constants $\alpha$ and $\beta$ (say, $\alpha=2$ and $\beta=3$), and two time limits $t_{back}$ and $t_{check}=\alpha^2\beta \cdot t_{back}$, expressed in milliseconds. Parameters $\alpha$, $\beta$ and $t_{back}$ are decided by governance. The intuition behind constant $\alpha$ and $\beta$ is that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why beta=3? I'd think beta=2 works fine, no?
I'm actually much more worried about alpha=2, well alpha>2 sounds really likely. We could increase alpha and decrease beta temporarily during a wasmtime upgrade perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could phrase this in terms of assumptions I guess: We set alpha=2 because we believe r_v adjusted runtime should never be off by more than a factor of 2. That's fairly strong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this: We've no solid justification for alpha or beta right now, but alpha must be set by our variance estimation for machines.
roadmap/implementers-guide/src/node/utility/candidate-validation-timesouts.md
Show resolved
Hide resolved
roadmap/implementers-guide/src/node/utility/candidate-validation-timesouts.md
Show resolved
Hide resolved
roadmap/implementers-guide/src/node/utility/candidate-validation-timesouts.md
Show resolved
Hide resolved
roadmap/implementers-guide/src/node/utility/candidate-validation-timesouts.md
Show resolved
Hide resolved
roadmap/implementers-guide/src/node/utility/candidate-validation-timesouts.md
Show resolved
Hide resolved
Any thoughts @pepyakin ? |
We're happy with roughly the r_v updating procedure specified here. We could implement r_v off-chain first, but use arithmetic compatible with on-chain operation, and move to tracking it on-chain later. It cannot be truly on-chain because you must declare your r_v when you register a new session key. We need not track any other statistics on-chain or off-chain in nodes, but some additional polkadot tooling should collect other statistics for this stuff, like variance for parachains, etc. Can you add a brief note about use on-chain compatible arithmetic @AlfonsoCev? |
We might add some comment that doing "batch verification" for anything heavy, like signatures, snarks, Merkle proofs, etc. gives a nice deterministic runtime which can be launched in another thread and simply trusted to terminate. We could either add this time or simply account for it separately based upon the size of the batch calls. Update: Added a security issue about this since we'll punt on this part here |
We're basically done for this initial stage. Alfonso never pushed anything about doing on-chain compatible arithmetic though maybe. We could mention that if the node runs the parachain code in native then it should maybe rerun in wasm under suspicious situations, but maybe this requires more design work. We might even specify interface guidelines for native code invoked from wasm, like requiring good somewhat deterministic timing estimates. We should eventually discuss the timing metrics in greater technical detail: Is there more than one CPU clock we could use? etc. |
Could you give some intuition on how this interacts with runtime upgrades/storage migrations? |
We're only discussing WASM execution time here, which only begins after we successfully download the block. Approval checkers learn their assignments from the VRF of the relay chain block that declares the candidate available, so then our 2/3rd honest assumption ensures reconstitution succeeds. If reconstruction takes too long then we'll repeatedly announce this fact, but we'll eventually become no-show and be replaced. If this happens to too many people, then we'll behave as if under a massive DoS attack, and |
Seems like an unwanted consequence of a runtime upgrade of one parachain, though, to grind the whole thing to a halt. |
We currently put code onto the relay chain even I think, so actually it'll halt everything anyways. We've fucked up if we permit such large blocks anyways. We do know one other partial solution: We distinguish "post code blocks" which provide new code but disallow all other functionality, so they only require availability but approvals become automatic. After availability, you could "switch code" whenever you like. At this point, we make backing and approval checkers who lack the code reconstruct the code before reconstructing the block. It's plausible some backing checkers already possess the large block, but mostly backers would take too long too, and often drop your blocks, so mostly just your parachain halts. To be clear, we're worried about far far more serious soundness problems than mere chain halts in this issue. We're just assuming ad hoc methods like limiting code size suffice for your concerns. |
We should update this to discuss non-timeout faults, like the wasm runner receiving SIGKILL due to violating rlimit, etc. We'll halt parachains whenever a super majority judges the code flawed, meaning any invalidity condition besides the PVF detecting the candidate being invalid, so including both timeouts and other non-timeout system faults like violating rlimit. If we slash, either 100% for false validity claims, or the smaller slash for false invalidity claims, and the claim says timeout or non-timeout system fault, then the slashed validator should demonstrate their configuration that causes either the false validity cliam or the timeout or fault to governance. If correct, then governance refunds the slash and halts the parachain. At these points, the parachain must convince governance they've fixed the code before continuing. We'll let governance and the community figure out if and when parachains should ever compensate temporarily slashed nodes for lost rewards. |
I have been pondering about this and this is what I came up with: So, why do we slash? We slash so tries of getting an invalid block in, to do some variant of a double spend for example lead to gambler's ruin. The Underlying ProblemSo why do we care about timeouts - could we not just abandon a candidate if enough approval checkers say: "I ran into a timeout validating this"? Answer: No, because this could be abused by an attacker to get free tries in getting a bad candidate in. Just craft a candidate that takes ages to execute. Approval checkers are either your own guys - who don't have to execute the candidate at all or good guys, which will timeout and you just successfully avoided slashing on a failed attempt. Non IssuesTherefore timeouts must mean an invalid vote and in the above scenario, that will be fine - a majority votes invalid and the attacker gets slashed. But what if a candidate was not 100% malicious as in the previous paragraph, but just really heavy so that some validators run into the timeout - they would report invalid. If they are in the minority the will get slashed mildly ... which should actually be fine: If the slash is mild enough, you just got punished for being slow ... better get faster hardware, better monitor your hardware and have alerts on execution times - we measure CPU time, not wall clock time - so heavy load is not an excuse. We can have docs for operators informing them, that they should monitor their execution times to avoid slashes for slow hardware. We do have conflicting requirements of not making those false invalid vote slashes too mild for spamming considerations, but that could be resolved relatively easily. (E.g. don't only disable validators if they lost enough stake, but also if they have been on the losing side of a dispute more often than x times in the last session.) Note: If that was done on purpose, e.g. an attacker ignoring the backing timeout, but provide a candidate just passing the approval checking timeout on his machine, he might have managed to produce mild slashes on only slightly slower validators at the high risk of getting slashed 100% itself. If he wants to minimize that risk, then again only real slow validators will get slashed. If he is fine with getting slashed 100%, then this brings us right to the next section: The Real IssueThe more problematic issue is the other way round: Some validators being super fast! In that case those super fast validators will be in the minority of approving/backing a deemed invalid candidate. We just slashed validators 100% for doing an excellent job! How big is the issue though? In backing we have way stricter timeouts than in approvals - so an honest backer running into that issue, must really be extraordinary fast. A malicious backer obviously could craft such a candidate and could cause fast honest validators to get slashed, at the cost of losing all of their own stake as well. Not a very attractive attack: You end up in a lose - lose situation, with nobody gaining anything. Still the backer could have been hacked, for example. This can be mitigated by requiring a backing vote threshold greater than 1, which should already provide sufficient protection: If we assume successful decentralization, it should be very unlikely for a hacker to have hijacked more than one validator. Summary & ConsiderationsWith mild slashes for voting invalid on valid candidates, significantly large number of required backing votes and good values for backing and approval checking timeouts, we should get very far without adding even more complexity to an already very complex system. While 2 bad validators are obviously way less than our usual byzantine assumption, the integrity of the system was never at stake, so I would argue this makes that weaker requirement justified. To avoid slashing fast honest validators 100% we should make sure to make the ratio |
If I understand you're making two statements:
Interesting points.. I kinda believe them against random events, which is mostly what this PR handles too, but they're iffy against adversarial parachains.. I'm dubious about 2 in particular because an adversarial parachain could likely identify the fastest validators from other behavior, and then simply wait until they saw all upcoming backers being fast, so zero slashing for the attacker, and pocketing the reporting rewards. We cannot depend solely upon timeouts here anyways, so imho we should do some work towards normalized time reporting by validators, which firstly means CPU timers, not the vague "milliseconds" written here. Indeed even the ratios may not make sense.. Are the historic execution time ratios described here useful? I suppose ratios being cross-parachain is a risky oversimplification. Yet, we're definitely not going to be doing parachain runtime testing across all validators, so are they still useful if they're cross-parachain? It's similarly true that if this proposal is too complex then we can tighten it around where the real damage gets done, or even tolerate more risk by being stricter elsewhere, primarily governance refunding slashes once we witness some machines running fast enough. At the extreme if this blows up several times then parachains development could become a much more "interactive" process, with auditors signing off on runtime upgrades or whatever. At the sci-fi extreme, we could even make validators produce a zero-knowledge proof of correct execution before being slashed, or even complicate our interactive proof (neither sounds realistic). As an aside, we've another adversarial behavior in which abusive backers permit extra large parachain blocks into the relay chain, which perhaps warrants forgoing their backer rewards. A prioir, 2 or 3 backers makes this rare enough that maybe it could be ignored, except it risks becoming a regular abuse. |
Anyways, I'm dubious about the ratios here for sure, but at least superficially the alpha and beta part still looks sensible because it gives assurances. We want backer / approval times to be less than 6 or some reason? |
Not that I know of. I guess we are a bit time sensitive in approval checking with noshows and such. For the large parachain block issue: I don't quite follow. Large parachain blocks only play a role in backing and availability recovery/approvals. And approval checkers will reject the candidate if it exceeds the size limit. For the malicious parachain case: If we are free to make approval validation and backing validation timeouts differ enough, to make it practically impossible that a fast validator can validate within the backing timeout, while the majority can't even in the longer approval timeout, then this should also not be an issue. The question is what are our practical limits for allowed timeout ratios? What limiting factors do we have? A very short backing validation timeout has obvious issues. A very long approval validation timeout would lead to noshows and thus needless work, harming performance and delaying finality. 🤔 |
In this PR, alpha=2 and beta=3 yields approval check timeouts being 6 times as long as backing timeouts, which sounds fine I guess? I think no show delays are like 12 or 24, not sure anymore. We're anyways unsure if nodes rebroadcasting their expected timeout estimates helps. It does not enter into the picture here, unless I forgot something. |
Stale. |
We only use wasmtime here, not wasmi. We never escalate and slash unless some approval checker times out (or reports invalid). Also, you never get slashed unless you're alpha slower or faster than 2/3rd of validators (or non-timeout invalidity reports).
We did not address non-timeout invalidity reports well enough here did we @AlfonsoCev ?
@AlistairStewart should review
We meant to open this from @AlfonsoCev machine earlier, but windows baffles me so maybe we botched it. Replaces #1628