Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not re-prepare PVFs if not needed #4211

Merged
merged 8 commits into from
Apr 25, 2024

Conversation

s0me0ne-unkn0wn
Copy link
Contributor

@s0me0ne-unkn0wn s0me0ne-unkn0wn commented Apr 19, 2024

Currently, PVFs are re-prepared if any execution environment parameter changes. As we've recently seen on Kusama and Polkadot, that may lead to a severe finality lag because every validator has to re-prepare every PVF. That cannot be avoided altogether; however, we could cease re-preparing PVFs when a change in the execution environment can't lead to a change in the artifact itself. For example, it's clear that changing the execution timeout cannot affect the artifact.

In this PR, I'm introducing a separate hash for the subset of execution environment parameters that changes only if a preparation-related parameter changes. It introduces some minor code duplication, but without that, the scope of changes would be much bigger.

TODO:

  • Add a test to ensure the artifact is not re-prepared if non-preparation-related parameter is changed
  • Add a test to ensure the artifact is re-prepared if a preparation-related parameter is changed
  • Add comments, warnings, and, possibly, a test to ensure a new parameter ever added to the executor environment parameters will be evaluated by the author of changes with respect to its artifact preparation impact and added to the new hash preimage if needed.

Closes #4132

@s0me0ne-unkn0wn s0me0ne-unkn0wn added the T8-polkadot This PR/Issue is related to/affects the Polkadot network. label Apr 20, 2024
@s0me0ne-unkn0wn s0me0ne-unkn0wn marked this pull request as ready for review April 20, 2024 14:41

let mut enc = b"prep".to_vec();
for param in &self.0 {
if matches!(param, StackLogicalMax(_) | PvfPrepTimeout(..) | WasmExtBulkMemory) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks brittle wrt adding new parameters. I would rewrite and use !matches such that any new parameter added is considered preparation related unless we explicitly name it here as non-preparation related.

Copy link
Contributor Author

@s0me0ne-unkn0wn s0me0ne-unkn0wn Apr 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an obvious improvement I've overseen, thank you for the suggestion!

Another question I'm trying to answer: is PvfPrepTimeout a parameter that affects preparation indeed?

That may be a bit counterintuitive, but I'm more inclined to say "no" than "yes". We have a strict pre-checking timeout, and we never re-pre-check, even if executor environment params change. The preparation timeout is lenient, and anyway, those timeouts are non-deterministic. So I don't really see much sense in re-preparing artifacts if this timeout changes (especially given that we're not going to ever decrease them). WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just try to see what would happen if we say no. Let's first pick the likely more troublesome situation, we for some reason reduce the timeout:

Artifacts are not being recompiled, but a node that restarts will and may now hit a timeout it did not hit before. If I am not mistaken, the node will not dispute, but simply don't vote - right? Now, if that happens on just a few nodes, not too much harm done.

Now, if this is a general problem, everytime a node restarts, we have another machine not voting. So we might run into finality issues.

Problems:

  1. Potential finality stall.
  2. The issue will only show up gradually over time and might be hard to debug. Real issues might only appear weeks later.

Now, let's flip it and let's say "yes". Now all validators will recompile immediately, including backers. I am assuming that anything that got backed with the old artifact, will also be approved and disputed with the old artifact (this is true, right?). In that case the situation would be way better, because now we only have a single parachain that will completely cease to make progress, but all other paras and the relay chain won't be affected at all. Only the backers for that para will try to prepare and fail, as long as they are assigned.

If we assume we will only ever increase that timeout, then of course this should be fine, but is this a sound assumption? Why would we change the timeout at all?

  1. Either, we are being attacked and someone abuses those long timeouts - we would need to reduce.
  2. We have legitimate PVFs that run into the limit ... we would need to increase or fix otherwise.

Couldn't say, which scenario is more likely. That we are never ever re-prechecking is more a bug than anything. What we should actually be doing is re-prechecking and only enacting the new paras, once this was successful. That would also avoid the finality issues on changing the parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the node will not dispute, but simply don't vote - right?

Yes, I believe so. Although we treat the preparation timeout as a deterministic error (as it hasn't failed during the pre-check), we don't dispute.

I am assuming that anything that got backed with the old artifact, will also be approved and disputed with the old artifact (this is true, right?).

We should always use execution parameters from the session where the candidate was produced, so yes, they should always be executed with the same artifact.

we only have a single parachain that will completely cease to make progress

Why do you think it's only a single parachain? The timeout is per-network per-session, so after passing the session boundary where this parameter change is enacted, all the artifacts should be re-prepared, as we saw during the Kusama and Polkadot incidents.

If we assume we will only ever increase that timeout, then of course this should be fine, but is this a sound assumption?

I remember we talked about decreasing the timeouts as not being safe, as it may make PVFs that are already on-chain fail. At the very least, we should have some tooling that could be used to check all the already-existing PVFs not to fail with the new set of executor parameters on the reference hardware (I think @ordian was working on that one?)

Copy link
Member

@ordian ordian Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least, we should have some tooling that could be used to check all the already-existing PVFs not to fail with the new set of executor parameters on the reference hardware (I think @ordian was working on that one?)

The tooling we do have, but currently it's not integrated into the release process. But even if it could be used to check the current PVFs at the time of the release, one could imagine an attacker uploading a new PVF (with on-demand core) right after the release that will be on the edge of the old time limit. So it doesn't solve the problem by itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it doesn't solve the problem by itself.

Rerunning pre-checking before enactment would.

Why do you think it's only a single parachain

Not necessarily a single one, but most certainly not all of them. Otherwise we would have been really stupid with the parameters.

I remember we talked about decreasing the timeouts as not being safe, as it may make PVFs that are already on-chain fail.

True, but I think this should be fixed. Without pre-checking we get a finality stall, that alone is reason enough to do it properly and not enact parameters that have not been checked. Like, assuming we want to change these parameters at all: Why would we only ever want to increase? Most likely we need to increase to mitigate some issue, but would actually want to decrease again later, once it is fixed. Machines also get faster, compilers get better, so the chances that we actually want to decrease at some point are likely higher that we would want to increase.

In other words, by saying "no" we dig ourselves deeper into a solution that is actually not sound.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should always use execution parameters from the session where the candidate was produced, so yes, they should always be executed with the same artifact.

But that's not true, is it ?

On approval voting we fetch executor params based on the block the candidate is included: https://github.com/paritytech/polkadot-sdk/blob/master/polkadot/node/core/approval-voting/src/lib.rs#L2967, .

On backing we fetch them based on the relay_parent:
https://github.com/paritytech/polkadot-sdk/blob/master/polkadot/node/core/backing/src/lib.rs#L666

I don't think there is anything preventing the two relay chain blocks being in different sessions at the boundary, especially with async backing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that's not true, is it ?

Hmmm, that may be an important find... If that does not hold, it should be fixed. The very purport of the executor parameters is to always use the same set of parameters with the same candidate.

polkadot/primitives/src/v7/executor_params.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@AndreiEres AndreiEres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

polkadot/primitives/src/v7/executor_params.rs Outdated Show resolved Hide resolved
@s0me0ne-unkn0wn s0me0ne-unkn0wn added this pull request to the merge queue Apr 25, 2024
Merged via the queue into master with commit c26cf3f Apr 25, 2024
141 of 143 checks passed
@s0me0ne-unkn0wn s0me0ne-unkn0wn deleted the s0me0ne/no-pvf-reprepare-if-non-needed branch April 25, 2024 10:40
TarekkMA pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this pull request Aug 2, 2024
Currently, PVFs are re-prepared if any execution environment parameter
changes. As we've recently seen on Kusama and Polkadot, that may lead to
a severe finality lag because every validator has to re-prepare every
PVF. That cannot be avoided altogether; however, we could cease
re-preparing PVFs when a change in the execution environment can't lead
to a change in the artifact itself. For example, it's clear that
changing the execution timeout cannot affect the artifact.

In this PR, I'm introducing a separate hash for the subset of execution
environment parameters that changes only if a preparation-related
parameter changes. It introduces some minor code duplication, but
without that, the scope of changes would be much bigger.

TODO:
- [x] Add a test to ensure the artifact is not re-prepared if
non-preparation-related parameter is changed
- [x] Add a test to ensure the artifact is re-prepared if a
preparation-related parameter is changed
- [x] Add comments, warnings, and, possibly, a test to ensure a new
parameter ever added to the executor environment parameters will be
evaluated by the author of changes with respect to its artifact
preparation impact and added to the new hash preimage if needed.

Closes paritytech#4132
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T8-polkadot This PR/Issue is related to/affects the Polkadot network.
Projects
None yet
6 participants