Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

PVF: Don't dispute on missing artifact #7011

Merged
merged 7 commits into from
Apr 20, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 33 additions & 17 deletions node/core/candidate-validation/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -691,38 +691,54 @@ trait ValidationBackend {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating candidate-validation tests would not harm

/// Tries executing a PVF. Will retry once if an error is encountered that may have been
/// transient.
///
/// NOTE: Should retry only on errors that are a result of execution itself, and not of
/// preparation.
async fn validate_candidate_with_retry(
&mut self,
raw_validation_code: Vec<u8>,
exec_timeout: Duration,
params: ValidationParams,
executor_params: ExecutorParams,
) -> Result<WasmValidationResult, ValidationError> {
// Construct the PVF a single time, since it is an expensive operation. Cloning it is cheap.
let prep_timeout = pvf_prep_timeout(&executor_params, PvfPrepTimeoutKind::Lenient);
// Construct the PVF a single time, since it is an expensive operation. Cloning it is cheap.
let pvf = PvfPrepData::from_code(raw_validation_code, executor_params, prep_timeout);

let mut validation_result =
self.validate_candidate(pvf.clone(), exec_timeout, params.encode()).await;

// If we get an AmbiguousWorkerDeath error, retry once after a brief delay, on the
// assumption that the conditions that caused this error may have been transient. Note that
// this error is only a result of execution itself and not of preparation.
if let Err(ValidationError::InvalidCandidate(WasmInvalidCandidate::AmbiguousWorkerDeath)) =
validation_result
{
// Wait a brief delay before retrying.
futures_timer::Delay::new(PVF_EXECUTION_RETRY_DELAY).await;
// Allow one retry for each kind of error.
let mut num_internal_retries_left = 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could make this higher, since this kind of error is probably the most likely to be transient.

let mut num_awd_retries_left = 1;
loop {
match validation_result {
Err(ValidationError::InvalidCandidate(
WasmInvalidCandidate::AmbiguousWorkerDeath,
)) if num_awd_retries_left > 0 => num_awd_retries_left -= 1,
Err(ValidationError::InternalError(_)) if num_internal_retries_left > 0 =>
num_internal_retries_left -= 1,
_ => break,
}

// If we got a possibly transient error, retry once after a brief delay, on the assumption
// that the conditions that caused this error may have resolved on their own.
{
// Wait a brief delay before retrying.
futures_timer::Delay::new(PVF_EXECUTION_RETRY_DELAY).await;

gum::warn!(
target: LOG_TARGET,
?pvf,
"Re-trying failed candidate validation due to AmbiguousWorkerDeath."
);
gum::warn!(
target: LOG_TARGET,
?pvf,
"Re-trying failed candidate validation due to possible transient error: {:?}",
validation_result
);

// Encode the params again when re-trying. We expect the retry case to be relatively
// rare, and we want to avoid unconditionally cloning data.
validation_result = self.validate_candidate(pvf, exec_timeout, params.encode()).await;
// Encode the params again when re-trying. We expect the retry case to be relatively
// rare, and we want to avoid unconditionally cloning data.
validation_result =
self.validate_candidate(pvf.clone(), exec_timeout, params.encode()).await;
}
}

validation_result
Expand Down
2 changes: 1 addition & 1 deletion node/core/pvf/src/execute/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,5 +23,5 @@
mod queue;
mod worker;

pub use queue::{start, ToQueue};
pub use queue::{start, PendingExecutionRequest, ToQueue};
pub use worker::{worker_entrypoint, Response as ExecuteResponse};
22 changes: 14 additions & 8 deletions node/core/pvf/src/execute/queue.rs
Original file line number Diff line number Diff line change
Expand Up @@ -50,13 +50,17 @@ slotmap::new_key_type! { struct Worker; }

#[derive(Debug)]
pub enum ToQueue {
Enqueue {
artifact: ArtifactPathId,
exec_timeout: Duration,
params: Vec<u8>,
executor_params: ExecutorParams,
result_tx: ResultSender,
},
Enqueue { artifact: ArtifactPathId, pending_execution_request: PendingExecutionRequest },
}

/// An execution request that should execute the PVF (known in the context) and send the results
/// to the given result sender.
#[derive(Debug)]
pub struct PendingExecutionRequest {
pub exec_timeout: Duration,
pub params: Vec<u8>,
pub executor_params: ExecutorParams,
pub result_tx: ResultSender,
}

struct ExecuteJob {
Expand Down Expand Up @@ -259,7 +263,9 @@ async fn purge_dead(metrics: &Metrics, workers: &mut Workers) {
}

fn handle_to_queue(queue: &mut Queue, to_queue: ToQueue) {
let ToQueue::Enqueue { artifact, exec_timeout, params, executor_params, result_tx } = to_queue;
let ToQueue::Enqueue { artifact, pending_execution_request } = to_queue;
let PendingExecutionRequest { exec_timeout, params, executor_params, result_tx } =
pending_execution_request;
gum::debug!(
target: LOG_TARGET,
validation_code_hash = ?artifact.id.code_hash,
Expand Down
15 changes: 14 additions & 1 deletion node/core/pvf/src/execute/worker.rs
Original file line number Diff line number Diff line change
Expand Up @@ -261,6 +261,13 @@ impl Response {
Self::InvalidCandidate(format!("{}: {}", ctx, msg))
}
}
fn format_internal(ctx: &'static str, msg: &str) -> Self {
if msg.is_empty() {
Self::InternalError(ctx.to_string())
} else {
Self::InternalError(format!("{}: {}", ctx, msg))
}
}
}

/// The entrypoint that the spawned execute worker should start with. The `socket_path` specifies
Expand Down Expand Up @@ -359,7 +366,13 @@ fn validate_using_artifact(
// [`executor_intf::prepare`].
executor.execute(artifact_path.as_ref(), params)
} {
Err(err) => return Response::format_invalid("execute", &err),
Err(err) =>
return if err.contains("failed to open file: No such file or directory") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really requires a refactor changing the error type to something sensible, like an enum. Matching on a string is way too error prone. Something changes the message, localization, ....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very likely to come from substrate, it's full of string errors. I agree that matching against strings is no-go, but otherwise we'd have to halt the pr.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really requires a refactor changing the error type to something sensible, like an enum

Vote up, I've already raised that concern somewhere... Many errors coming from Substrate are not sensible at all. Also agree that string matching makes no good.

@mrcnski a (probably stupid) idea: until we have an enum error from Substrate, would it be better not to rely on its string errors but to check for the file existence ourselves? Should be simple enough. Of course, it introduces a race condition, but still better than parsing strings. Also, nobody guarantees that the file persists between the moments when it is open and when it is read, so that kind of race condition already exists anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it comes from Substrate. Didn't think about localization. 😬 I considered just treating RuntimeConstruction itself as an internal error, but seems it's also used for some case where wasm runs out of memory, which would be a problem with the PVF itself. link

Checking for the file existence seems sensible to me...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problems with the PVF itself we agreed are also no reason to raise a dispute, since we have pre-checking enabled. Basically any error that is independent of the candidate at hand should not be cause for a dispute.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's in any case create a ticket for fixing those string errors - or at least the one in question right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like "output exceeds bounds of wasm memory" should not be an issue with runtime construction, but is actually indicative of a malicious PVF. And I think this would get past pre-checking, since that just compiles and doesn't execute.

https://github.com/paritytech/substrate/blob/master/client/executor/wasmtime/src/runtime.rs#L772-L780

// Do a length check before allocating. The returned output should not be bigger than the
// available WASM memory. Otherwise, a malicious parachain can trigger a large allocation,
// potentially causing memory exhaustion.
//
// Get the size of the WASM memory in bytes.
let memory_size = ctx.as_context().data().memory().data_size(ctx);
if checked_range(output_ptr as usize, output_len as usize, memory_size).is_none() {
    Err(WasmError::Other("output exceeds bounds of wasm memory".into()))?
}

(I used WasmError::Other here to match the other errors in the file, without realizing it gets converted to RuntimeConstruction. 🤷‍♂️)

Anyway, basically, this one "output exceeds bounds of wasm memory" case is deterministic and we should definitely vote against. If we gave it a new separate enum in Substrate, then we could treat the existing RuntimeConstruction as an internal error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raised a small fix for the output-bounds case here, but I'm still not confident that RuntimeConstruction is always a transient error and don't think we should use it. Unless we treat it as a new "possibly transient" case, meaning we retry and dispute if it happens again. 🤷‍♂️

For the file-not-found case, there is not a clear way to fix the error story on the Substrate side. Just having another check here for file existence should be enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with the check, please reference the substrate issue though in a comment. This way, readers understand why we did it this way and we can reevaluate once the issue is fixed.

// Raise an internal error if the file is missing.
Response::format_internal("execute: missing file", &err)
} else {
Response::format_invalid("execute", &err)
},
Ok(d) => d,
};

Expand Down
Loading