Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

PVF: Instantiate wasm in pre-checking #7246

Merged
merged 11 commits into from
Jun 2, 2023

Conversation

mrcnski
Copy link
Contributor

@mrcnski mrcnski commented May 17, 2023

PULL REQUEST

Overview

Do an additional instantiation check to catch more issues.

Relevant issues

Closes #7218


Pre-checking mostly consists of attempting to prepare (compile) the PVF WASM blob. We use more strict limits (e.g. timeouts) here compared to regular preparation for execution. This way errors during preparation later are likely unrelated to the PVF itself, as it already passed pre-checking. We can treat such errors as local node issues.

We also have an additional step where we attempt to instantiate the WASM runtime without running it. This is unrelated to preparation so we don't time it, but it does help us catch more issues.
Copy link
Contributor Author

@mrcnski mrcnski May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we now treat runtime construction errors in execution as internal errors (abstain from voting)? I didn't do that because the PVF can still be un-instantiable on certain machines/configurations while making it through the pre-check vote, making attacks against those environments possible.

Or can we abstain from voting and still be sure that finality won't stall due to this PVF, since it passed the pre-checking quorum?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, I think we have to have nodes abstain from voting in that case, anything else would be unfair: They voted against the PVF in pre-checking, but later we slash them because they fail validating a candidate? Does not sound fair.

On the other hand if it passed pre-checking, we know there is a super majority being able to validate the candidate, so it should be "fine" if some nodes don't vote.

I do think we have to time the instantiation - it could be possible to make this take a very long time?

At the same time, I would assume in the honest case for it to be very quick, so it should be ok to make this part of the preparation timeout in pre-checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think we have to time the instantiation - it could be possible to make this take a very long time?

At the same time, I would assume in the honest case for it to be very quick, so it should be ok to make this part of > the preparation timeout in pre-checking.

We do have the long host-side timeout, but yeah that's not ideal. Lumping it in with the preparation timeout makes it harder to reason about as we are no longer comparing the same operations in prechecking vs. preparation.

Best would be to have a separate timeout for instantiation, as well as for instantiation in execution where it would be more lenient. This way we actually make sure that if it passes in prechecking it should pass in execution. And having it in a separate thread here as well as in execution means we can do more fine-grained sandboxing later. Or am I overcomplicating it? I can also just do your suggestion for now and file an issue for later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that our pre-checking timeout is already way lower than the preparation timeout, it seems fine to include the instantiation - assuming we make sure the timeout is such that for honest PVFs it should not trigger. So indeed, I think we can keep it simple. It does not have to be comparable - it has to be "lower enough". Given that execution timeout is in the seconds, which includes instantiation I can hardly imagine how this simplification could be causing issues. It should be clearly documented though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we need the compiled artifact on-disk to run instantiation, and by that point the timeout has been stopped. We need to implement paritytech/substrate#13861. Then we could check that the artifact is instantiable before we actually write it to disk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd still publish VRFs but then no-show I think. In principle a vote against the PVF could mean "no-show early" but afaik there is no reason forthat extra code complexity, probably you just announce your VRF and then abandon it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we need the compiled artifact on-disk to run instantiation, and by that point the timeout has been stopped. We need to implement paritytech/substrate#13861. Then we could check that the artifact is instantiable before we actually write it to disk.

Let's just make sure there is some timeout. The rest is: "What is easiest to implement" - it should not matter too much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eskimor Done. Once we have sandboxing I think this will need its own thread. For now it should be good.

node/core/pvf/src/pvf.rs Outdated Show resolved Hide resolved
@mrcnski mrcnski requested a review from eskimor May 17, 2023 21:52
///
/// # Safety
///
/// See `execute` for safety considerations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

[`execute`]

to create a doc-link

node/core/pvf/worker/src/prepare.rs Outdated Show resolved Hide resolved
@mrcnski mrcnski requested review from slumber and burdges May 22, 2023 20:10
@mrcnski mrcnski self-assigned this May 22, 2023
@mrcnski mrcnski added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance. labels May 22, 2023
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this to pvf/common since Executor was now used in prechecking for runtime construction!

Copy link
Member

@eskimor eskimor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @mrcnski !

node/core/pvf/common/src/executor_intf.rs Outdated Show resolved Hide resolved
sp_io::trie::HostFunctions,
);

/// The validation externalities that will panic on any storage related access.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear here, why we would want that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was already there, I just moved some stuff around in this PR. But, it is a good question. I always assumed it was because validation shouldn't ever need to call these functions, though maybe there is a stronger reason. We should confirm the reasoning and document it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to find #3005 with some justification, which I added as docs here. :)

(Aside: I also randomly found #2710 which is amazing, especially the linked hackmd.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is off-topic but I feel like we should make the historical PRs more discoverable. Especially now that I borked the blame history with all my refactors. 🙈 Maybe a HISTORY.md would be nice, for context and for linking out to important PRs. And why is this project closed I wonder? 🤔

node/core/pvf/common/src/executor_intf.rs Show resolved Hide resolved
node/core/pvf/execute-worker/src/lib.rs Show resolved Hide resolved
node/core/pvf/prepare-worker/src/lib.rs Show resolved Hide resolved
@mrcnski mrcnski requested a review from eskimor May 30, 2023 19:00
@mrcnski
Copy link
Contributor Author

mrcnski commented Jun 2, 2023

bot merge

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Instantiate wasm in pre-checking
4 participants