-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: use hedge middleware #1089
Conversation
We handle request cancellation in two places: before we transition into the AwaitingResponse state, and while we are in AwaitingResponse. We need both places, or else if we started processing a request, we wouldn't process the cancellation until the timeout elapsed. The first is a check that the oneshot is not already canceled. For the second, we wait on a cancellation, either from a timeout or from the tx channel closing.
These messages might be unsolicited, or they might be a response to a request we already canceled. So don't fail the whole connection, just drop the message and move on.
This makes two changes relative to the existing download code: 1. It uses a oneshot to attempt to cancel the download task after it has started; 2. It encapsulates the download creation and cancellation logic into a Downloads struct.
Try to use the better cancellation logic to revert to previous sync algorithm. As designed, the sync algorithm is supposed to proceed by downloading state prospectively and handle errors by flushing the pipeline and starting over. This hasn't worked well, because we didn't previously cancel tasks properly. Now that we can, try to use something in the spirit of the original sync algorithm.
Normal operation, showing spikes when we start downloading new blocks, followed by dead periods where we're waiting for failed or high-latency requests to time out before we can keep going: Here you can see that the peer set is being fully utilized in the peak periods and goes unused in the dead periods. With these changes, after the hedge middleware kicks in, downloads slow to a crawl: Here you can see that the peer set is not being well-utilized. |
Using hdevalence/tower@486360e fixes the performance regression, and seems to do better. These charts show higher sustained download speeds and periods where there's less waiting for a single block to arrive, although the problem is not completely gone: |
@hdevalence are we going to pursue hedge or? |
Could this PR fix the sync hang in #1181? |
This was blocked on tower-rs/tower#472 referenced above but can go ahead now. I'd prefer to combine this PR with #1041, because that way there's just one set of sync changes to keep track of and review. |
The hedge middleware implements hedged requests, as described in _The Tail At Scale_. The idea is that we auto-tune our retry logic according to the actual network conditions, pre-emptively retrying requests that exceed some latency percentile. This would hopefully solve the problem where our timeouts are too long on mainnet and too slow on testnet.
Sounds good. Did you want a review, even though the tests are failing? |
34c16c4
to
ef82204
Compare
This comment has been minimized.
This comment has been minimized.
Oops, sorry, we're missing the final commit here in #1041. |
I'd rather just close this and wait until being finished with rebasing #1041. |
/// their connection state. | ||
/// | ||
/// (ObtainTips failures use the sync restart timeout.) | ||
const TIPS_RETRY_TIMEOUT: Duration = Duration::from_secs(60); | ||
/// Controls how long we wait to restart syncing after finishing a sync run. | ||
/// | ||
/// This timeout should be long enough to: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good - there are some comments that need updating.
I haven't tested it yet.
The hedge middleware implements hedged requests, as described in The Tail At Scale. The idea is that we auto-tune our retry logic according to the actual network conditions, pre-emptively retrying requests that exceed some latency percentile. This would hopefully solve the problem where our timeouts are too long on mainnet and too slow on testnet.
Currently this doesn't work well, since for some reason as soon as the hedge middleware kicks in, something blocks requests from actually going to the peer set. This shows up as big delays in pushing requests through the stack of middleware but many ready peer connections in the peer set.
I'm not totally sure why this happens. At the moment digging into it seems like a low priority, so this PR can sit open for a while.