Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pvf validation timeout #4969

Closed
xlc opened this issue Jul 8, 2024 · 8 comments
Closed

pvf validation timeout #4969

xlc opened this issue Jul 8, 2024 · 8 comments

Comments

@xlc
Copy link
Contributor

xlc commented Jul 8, 2024

Getting this error on our internal testnet

2024-07-08 05:38:35.865 WARN tokio-runtime-worker parachain::pvf: execution worker concluded, error occurred: candidate validation: invalid: hard timeout artifact_id=ArtifactId { code_hash: 0xb14e5edf8ef54349c887a9ee7cdaddb39156aac9ca53d3a1c434697628c4feff, executor_params_hash: 0x03170a2e7597b7b7e3d84c05391d139a62b157e78786d8c082f29dcf4c111314 } worker=Worker(1v267) worker_rip=true
2024-07-08 05:38:35.865 INFO tokio-runtime-worker parachain::candidate-validation: Failed to validate candidate para_id=Id(2000) error=Invalid(HardTimeout)
2024-07-08 05:38:35.866 WARN tokio-runtime-worker parachain::candidate-backing: Validation yielded an invalid candidate candidate_hash=0xb149d18e0796830b5c03f699b6002d4a42cf7b81c24b29a1b5b46c22f0617f56 reason=Timeout traceID=235656643200957744926756583968484633930

The para block is empty and produced in 3ms

2024-07-08 09:32:00.019 INFO tokio-runtime-worker sc_basic_authorship::basic_authorship: 🎁 Prepared block for proposing at 2253304 (3 ms) [hash: 0x3713a36193696554d697035c9788c1e3912e2d1c5fd736798d04e919c2c0b992; parent_hash: 0x1732…c11d; extrinsics (8): [0x7866…5f7d, 0x66c3…5293, 0x9a3d…0e44, 0xadb6…0f1b, 0x495c…19cf, 0x1061…b5dd, 0x121e…e936, 0x9209…8606] 

Relaychain is using 1.10.0 node with runtime of rococo 9420
Parachain is Acala 2.25.0 which is using polkadot sdk 1.9.0
It has been running for few months fine and suddenly getting this error and I don't see any special transaction/action that may have triggered this.

I will give #4640 a try to see if I can dig out more info

@alexggh
Copy link
Contributor

alexggh commented Jul 8, 2024

The error says the candidate took too long to validate, and most of the time I saw it it was because of mismatch in computing power between the collator and the validator.

Are you sure the block took just 3ms to build? Because the timestamps you provided don't seem to match.

@xlc
Copy link
Contributor Author

xlc commented Jul 8, 2024

The validator log and collator log are gathered at different time and referencing to different blocks, but the parachain is in stuck so it just keep reproducing the same block and I am pretty sure it is a small one as I can still see the pending block and it is very empty, no runtime migration, no scheduling tasks etc.

The validator and collator are running on machines with same spec and the fact it was working few days ago for many months.

@bkchr
Copy link
Member

bkchr commented Jul 8, 2024

Are you sure the block took just 3ms to build? Because the timestamps you provided don't seem to match.

Yeah these 3ms in the log reference the build time of the entire block.

@xlc and me already discussed this on element. I don't really get why this block runs into performance issues.

@alexggh
Copy link
Contributor

alexggh commented Jul 8, 2024

Yes, it is weird, can you provide the full logs for both collators and validators for a period of time, maybe something comes up.

@xlc
Copy link
Contributor Author

xlc commented Jul 8, 2024

I think it is related to IO issue. Not sure if is the node somehow suddenly making unreasonable amount of IO requests or somewhere in AWS triggered it.
Screenshot 2024-07-09 at 11 55 40 AM

@xlc
Copy link
Contributor Author

xlc commented Jul 9, 2024

yeah can confirm it is indeed IO issue. the startup validator bench actually highlights the IO performance is significantly lower than expected

@xlc xlc closed this as completed Jul 9, 2024
@bkchr
Copy link
Member

bkchr commented Jul 9, 2024

@xlc how did the IO degrade that much? :D Any ideas?

@xlc
Copy link
Contributor Author

xlc commented Jul 9, 2024

not sure what triggers it but this is using AWS EFS with burst mode. ie it can only have peak IO for short period of time before getting throttled so maybe it just cross the default throughput line and getting heavily throttled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants