-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple tests fail in CI with "inbound service overloaded" warnings #6506
Comments
Medium priority because it's happened twice in ~10 hours. |
This could be a temporary failure caused by a network scanner, or by load on the Zcash network. PR #6235 modifies the mempool, but that code is off by default. I can't see anything else that's recent that could be related. |
Another mempool test failed in PR #6486, with what appears to be a timing issue components::mempool::storage::tests::prop::eviction_list_time_mixed:
It's possible that a recent dependency update or compiler update changed Zebra's performance or execution order, and we need to fix some tests to make them more robust. |
sync_large_checkpoints_mempool_mainnet
failures in CI
Failed PR #6512 in the merge queue: @mpguerra another test we should fix, the overall error rate is stopping us merging PRs. |
Unfortunately this fix wasn't enough, PR #6534 failed with this error: |
It looks like the syncer just hangs after downloading the genesis block:
What should be happening is the syncer downloading the next checkpoint, and exiting once block 1 is committed to the state. |
I'm starting to wonder if this revert was incorrect:
I'm going to log the user-agents of the overloaded nodes in #6537, and see if that helps. |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
@mpguerra the next step in solving this issue is merging PR #6537. Either it will fix the issue, or it will give us diagnostics that help with the fix. @oxarbitrage has been auto-assigned as the reviewer. |
Failed on the |
@teor2345 can you please add a size for this issue? |
This is still happening, but it's much less frequent. It seems like a state, verifier, or network hang:
https://github.com/ZcashFoundation/zebra/actions/runs/4824819949/jobs/8595587969?pr=6581#step:3:4060 I think the next step would be adding some progress logs, just before the first big checkpoint is verified, to find out which part of Zebra is hanging. It doesn't seem like it is recovering at all, even after a syncer restart. So it is probably the inbound service or something it depends on (the address book, verifiers, or state). |
I'm going to un-assign myself and mark this fix as optional, because it doesn't seem to be happening very often any more. |
Looking at this failure output, it seems like there are a lot of duplicate block and transaction broadcasts. We could check why the inbound service is performing poorly, or we could combine duplicate inbound requests. But first I want to make sure the inbound service is actually getting time to run after every request is queued. Show details of inbound block and transaction broadcasts2023-05-15T15:49:35.023057Z INFO {net="Main"}:{peer=Out("v4redacted:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection remote_user_agent="/MagicBean:5.4.0/" negotiated_version=Version(170100) peer="v4redacted:8233" last_peer_state=Some("AwaitingRequest::In::Req::AdvertiseTransactionIds") remote_height=Height(2087054) cached_addrs=1 connection_state=AwaitingRequest https://github.com/ZcashFoundation/zebra/actions/runs/4981296666/jobs/8916500733#step:3:3906 |
From PR #6665:
This seems to be a peer failure, but maybe "all ready peers are missing inventory" shouldn't hang for 4 minutes before trying again. It's not the same as this bug, but it could be related to it. |
This seems to have stopped happening after our recent fixes. |
Motivation
This test is failing in CI on main: https://github.com/ZcashFoundation/zebra/actions/runs/4687263326/jobs/8306335685#step:15:598
It also failed in 'CI Docker / Test all' in a dependabot PR: https://github.com/ZcashFoundation/zebra/actions/runs/4691947659/jobs/8317961127?pr=6505#step:3:3777
The text was updated successfully, but these errors were encountered: