Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync hang for 22 minutes on mainnet after checkpoint error #1181

Closed
teor2345 opened this issue Oct 19, 2020 · 7 comments
Closed

Sync hang for 22 minutes on mainnet after checkpoint error #1181

teor2345 opened this issue Oct 19, 2020 · 7 comments
Assignees
Labels
C-bug Category: This is a bug

Comments

@teor2345
Copy link
Contributor

teor2345 commented Oct 19, 2020

Version

zebrad main branch, as at 19 October 2019.

Platform

Linux ... 5.4.68 #1-NixOS SMP Sat Sep 26 16:03:16 UTC 2020 x86_64 GNU/Linux

Description

zebrad start on mainnet did not output any logs at info level for 22 minutes. It restarted sync after a checkpoint error, started obtain_tips, then hung for 22 minutes. Then it continued to log similar checkpoint errors, without making progress.

After restarting Zebra, the sync continued to make progress.

I tried this:

Running zebrad start on mainnet, with some global and some local peers.
My network is slower and higher latency than the average Zebra developer.

I expected to see this happen:

zebrad start syncs up to the tip of the best chain.

Checkpoint errors contain a detailed explanation of the error, rather than a generic error message.

Instead, this happened:

Error Before Hang

Error: 
   0: failed to verify block
   1: block could not be checkpointed

Metadata

key value
version 3.0.0-alpha.0

SpanTrace and Logs

This error was preceded by at least one similar checkpoint error.
Oct 19 14:11:19.525  INFO sync: zebrad::components::sync: starting sync, obtaining new tips
Oct 19 14:11:19.525  INFO sync:obtain_tips: zebra_state::util: created block locator tip_height=Height(106000) min_locator_height=105901 locators=[Height(106000), Height(105999), Height(105998), Height(105996), Height(105992), Height(105984), Height(105968), Height(105936), Height(105901)]
Oct 19 14:11:31.034  WARN sync: zebrad::components::sync: sync error in ready task, waiting to restart sync e=
   0: zebrad::components::sync::block_fetch_verify
           with hash=0000000020a7360c83af082530bd98a573f8ba543ab742605b846955b0bc43f4
             at zebrad/src/components/sync.rs:615
   1: zebrad::components::sync::obtain_tips
             at zebrad/src/components/sync.rs:314
   2: zebrad::components::sync::sync
             at zebrad/src/components/sync.rs:169
Oct 19 14:13:11.037  INFO sync: zebrad::components::sync: starting sync, obtaining new tips
Oct 19 14:13:11.037  INFO sync:obtain_tips: zebra_state::util: created block locator tip_height=Height(106000) min_locator_height=105901 locators=[Height(106000), Height(105999), Height(105998), Height(105996), Height(105992), Height(105984), Height(105968), Height(105936), Height(105901)]

There was no info-level output for 22 minutes after the hang, then it logged a similar error, but only delayed for about 2 minutes.

@teor2345 teor2345 added C-bug Category: This is a bug S-needs-triage Status: A bug report needs triage labels Oct 19, 2020
@teor2345 teor2345 added this to the First Alpha Release milestone Oct 19, 2020
@teor2345 teor2345 mentioned this issue Oct 19, 2020
44 tasks
@teor2345
Copy link
Contributor Author

@hdevalence I think you were the last person to modify the sync code, do you have any insights here?

@yaahc how can we show the detailed checkpoint error message in the logs, rather than just "block could not be checkpointed" ?

@teor2345
Copy link
Contributor Author

teor2345 commented Oct 19, 2020

how can we show the detailed checkpoint error message in the logs, rather than just "block could not be checkpointed" ?

I tried adding #[source] to the inner errors, based on:
https://docs.rs/thiserror/1.0.21/thiserror/#details

If that's a good way to fix this issue, let's open a follow-up ticket to fix all the error types in every zebra crate?

@teor2345
Copy link
Contributor Author

The Windows CI sync tests also hang intermittently, when initialising the syncer:
https://github.com/ZcashFoundation/zebra/pull/1179/checks?check_run_id=1280264596

The hangs might be due to this bug, or they might be caused by the network on the CI host.

@hdevalence hdevalence self-assigned this Oct 22, 2020
@hdevalence
Copy link
Contributor

I saw similar issues while getting #1041 working but plumbing error context through the checkpointer was more work than I wanted to commit to doing in that chunk of work.

@yaahc
Copy link
Contributor

yaahc commented Oct 22, 2020

If that's a good way to fix this issue, let's open a follow-up ticket to fix all the error types in every zebra crate?

I think we may want to make a style choice here and choose to avoid Tuple style enum variants in our error types. If we write the errors as struct-style enum variants where we always remember to name the source error field source then thiserror will always automatically use that as a source error kind. I feel like with tuple enums + an error derive it is way too easy to make this mistake and accidentally break the chain of source errors.

@yaahc yaahc self-assigned this Oct 23, 2020
@hdevalence
Copy link
Contributor

Unblocked by #1041 and #1198.

@teor2345
Copy link
Contributor Author

We think this issue is fixed by the RocksDB change in #1325.

@mpguerra mpguerra removed the S-needs-triage Status: A bug report needs triage label Feb 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug
Projects
None yet
Development

No branches or pull requests

4 participants