WIP: use hedge middleware #1089

hdevalence · 2020-09-22T20:15:10Z

The hedge middleware implements hedged requests, as described in The Tail At Scale. The idea is that we auto-tune our retry logic according to the actual network conditions, pre-emptively retrying requests that exceed some latency percentile. This would hopefully solve the problem where our timeouts are too long on mainnet and too slow on testnet.

Currently this doesn't work well, since for some reason as soon as the hedge middleware kicks in, something blocks requests from actually going to the peer set. This shows up as big delays in pushing requests through the stack of middleware but many ready peer connections in the peer set.

I'm not totally sure why this happens. At the moment digging into it seems like a low priority, so this PR can sit open for a while.

We handle request cancellation in two places: before we transition into the AwaitingResponse state, and while we are in AwaitingResponse. We need both places, or else if we started processing a request, we wouldn't process the cancellation until the timeout elapsed. The first is a check that the oneshot is not already canceled. For the second, we wait on a cancellation, either from a timeout or from the tx channel closing.

These messages might be unsolicited, or they might be a response to a request we already canceled. So don't fail the whole connection, just drop the message and move on.

This makes two changes relative to the existing download code: 1. It uses a oneshot to attempt to cancel the download task after it has started; 2. It encapsulates the download creation and cancellation logic into a Downloads struct.

Try to use the better cancellation logic to revert to previous sync algorithm. As designed, the sync algorithm is supposed to proceed by downloading state prospectively and handle errors by flushing the pipeline and starting over. This hasn't worked well, because we didn't previously cancel tasks properly. Now that we can, try to use something in the spirit of the original sync algorithm.

hdevalence · 2020-09-22T20:20:24Z

Normal operation, showing spikes when we start downloading new blocks, followed by dead periods where we're waiting for failed or high-latency requests to time out before we can keep going:

Here you can see that the peer set is being fully utilized in the peak periods and goes unused in the dead periods.

With these changes, after the hedge middleware kicks in, downloads slow to a crawl:

Here you can see that the peer set is not being well-utilized.

hdevalence · 2020-09-22T21:43:14Z

Using hdevalence/tower@486360e fixes the performance regression, and seems to do better. These charts show higher sustained download speeds and periods where there's less waiting for a single block to arrive, although the problem is not completely gone:

dconnolly · 2020-10-13T02:41:22Z

@hdevalence are we going to pursue hedge or?

teor2345 · 2020-10-19T05:18:37Z

Could this PR fix the sync hang in #1181?

hdevalence · 2020-10-21T19:58:07Z

This was blocked on tower-rs/tower#472 referenced above but can go ahead now. I'd prefer to combine this PR with #1041, because that way there's just one set of sync changes to keep track of and review.

The hedge middleware implements hedged requests, as described in _The Tail At Scale_. The idea is that we auto-tune our retry logic according to the actual network conditions, pre-emptively retrying requests that exceed some latency percentile. This would hopefully solve the problem where our timeouts are too long on mainnet and too slow on testnet.

teor2345 · 2020-10-21T22:28:41Z

Sounds good.

Did you want a review, even though the tests are failing?
(I think the failure is spurious - the retry timeout test is obsoleted by the download set changes.)

teor2345 · 2020-10-21T22:55:57Z

Oops, sorry, we're missing the final commit here in #1041.

hdevalence · 2020-10-21T22:59:42Z

I'd rather just close this and wait until being finished with rebasing #1041.

zebrad/src/components/sync.rs

teor2345 · 2020-10-21T23:06:03Z

zebrad/src/components/sync.rs

-/// their connection state.
-///
-/// (ObtainTips failures use the sync restart timeout.)
-const TIPS_RETRY_TIMEOUT: Duration = Duration::from_secs(60);
 /// Controls how long we wait to restart syncing after finishing a sync run.
 ///
 /// This timeout should be long enough to:


This comment needs updating, but I don't know enough about the details of #1041 and #1089 to make an accurate suggestion.

teor2345

This looks good - there are some comments that need updating.

I haven't tested it yet.

hdevalence added 5 commits September 22, 2020 11:00

zebrad: rename sync::Error alias to BoxError.

842b88b

network: don't fail on unsolicited messages

25b10c4

These messages might be unsolicited, or they might be a response to a request we already canceled. So don't fail the whole connection, just drop the message and move on.

zebrad: create a Downloads Stream for syncing.

b7a8aec

This makes two changes relative to the existing download code: 1. It uses a oneshot to attempt to cancel the download task after it has started; 2. It encapsulates the download creation and cancellation logic into a Downloads struct.

hdevalence mentioned this pull request Sep 23, 2020

hedge: don't reserve slots for hedged requests tower-rs/tower#472

Merged

teor2345 mentioned this pull request Oct 19, 2020

Tracking: sync correctness #884

Closed

44 tasks

hdevalence marked this pull request as ready for review October 21, 2020 19:58

hdevalence force-pushed the hedge branch from c7c1e03 to c1d5461 Compare October 21, 2020 20:02

hdevalence force-pushed the download-set branch from 34c16c4 to ef82204 Compare October 21, 2020 22:38

This comment has been minimized.

Sign in to view

teor2345 closed this Oct 21, 2020

teor2345 reopened this Oct 21, 2020

hdevalence closed this Oct 21, 2020

teor2345 reviewed Oct 21, 2020

View reviewed changes

zebrad/src/components/sync.rs Show resolved Hide resolved

teor2345 reviewed Oct 21, 2020

View reviewed changes

zebrad/src/components/sync.rs Show resolved Hide resolved

teor2345 reviewed Oct 21, 2020

View reviewed changes

teor2345 suggested changes Oct 21, 2020

View reviewed changes

hdevalence mentioned this pull request Oct 22, 2020

Add a download set. #1041

Merged

teor2345 deleted the hedge branch March 21, 2022 21:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: use hedge middleware #1089

WIP: use hedge middleware #1089

hdevalence commented Sep 22, 2020

hdevalence commented Sep 22, 2020

hdevalence commented Sep 22, 2020

dconnolly commented Oct 13, 2020

teor2345 commented Oct 19, 2020

hdevalence commented Oct 21, 2020

teor2345 commented Oct 21, 2020

This comment has been minimized.

teor2345 commented Oct 21, 2020

hdevalence commented Oct 21, 2020

teor2345 Oct 21, 2020

teor2345 left a comment

WIP: use hedge middleware #1089

WIP: use hedge middleware #1089

Conversation

hdevalence commented Sep 22, 2020

hdevalence commented Sep 22, 2020

hdevalence commented Sep 22, 2020

dconnolly commented Oct 13, 2020

teor2345 commented Oct 19, 2020

hdevalence commented Oct 21, 2020

teor2345 commented Oct 21, 2020

This comment has been minimized.

teor2345 commented Oct 21, 2020

hdevalence commented Oct 21, 2020

teor2345 Oct 21, 2020

Choose a reason for hiding this comment

teor2345 left a comment

Choose a reason for hiding this comment