Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: use hedge middleware #1089

Closed
wants to merge 6 commits into from
Closed

WIP: use hedge middleware #1089

wants to merge 6 commits into from

Conversation

hdevalence
Copy link
Contributor

The hedge middleware implements hedged requests, as described in The Tail At Scale. The idea is that we auto-tune our retry logic according to the actual network conditions, pre-emptively retrying requests that exceed some latency percentile. This would hopefully solve the problem where our timeouts are too long on mainnet and too slow on testnet.

Currently this doesn't work well, since for some reason as soon as the hedge middleware kicks in, something blocks requests from actually going to the peer set. This shows up as big delays in pushing requests through the stack of middleware but many ready peer connections in the peer set.

I'm not totally sure why this happens. At the moment digging into it seems like a low priority, so this PR can sit open for a while.

We handle request cancellation in two places: before we transition into
the AwaitingResponse state, and while we are in AwaitingResponse.  We
need both places, or else if we started processing a request, we
wouldn't process the cancellation until the timeout elapsed.

The first is a check that the oneshot is not already canceled.

For the second, we wait on a cancellation, either from a timeout or from
the tx channel closing.
These messages might be unsolicited, or they might be a response to a
request we already canceled.  So don't fail the whole connection, just
drop the message and move on.
This makes two changes relative to the existing download code:

1.  It uses a oneshot to attempt to cancel the download task after it
    has started;

2.  It encapsulates the download creation and cancellation logic into a
    Downloads struct.
Try to use the better cancellation logic to revert to previous sync
algorithm.  As designed, the sync algorithm is supposed to proceed by
downloading state prospectively and handle errors by flushing the
pipeline and starting over.  This hasn't worked well, because we didn't
previously cancel tasks properly.  Now that we can, try to use something
in the spirit of the original sync algorithm.
@hdevalence
Copy link
Contributor Author

Normal operation, showing spikes when we start downloading new blocks, followed by dead periods where we're waiting for failed or high-latency requests to time out before we can keep going:

Screenshot from 2020-09-22 13-17-24

Here you can see that the peer set is being fully utilized in the peak periods and goes unused in the dead periods.

With these changes, after the hedge middleware kicks in, downloads slow to a crawl:

Screenshot from 2020-09-22 13-17-49

Here you can see that the peer set is not being well-utilized.

@hdevalence
Copy link
Contributor Author

Using hdevalence/tower@486360e fixes the performance regression, and seems to do better. These charts show higher sustained download speeds and periods where there's less waiting for a single block to arrive, although the problem is not completely gone:

Screenshot from 2020-09-22 14-41-16

Screenshot from 2020-09-22 14-40-46

@dconnolly
Copy link
Contributor

@hdevalence are we going to pursue hedge or?

@teor2345
Copy link
Contributor

Could this PR fix the sync hang in #1181?

@teor2345 teor2345 mentioned this pull request Oct 19, 2020
44 tasks
@hdevalence
Copy link
Contributor Author

This was blocked on tower-rs/tower#472 referenced above but can go ahead now. I'd prefer to combine this PR with #1041, because that way there's just one set of sync changes to keep track of and review.

@hdevalence hdevalence marked this pull request as ready for review October 21, 2020 19:58
The hedge middleware implements hedged requests, as described in _The
Tail At Scale_. The idea is that we auto-tune our retry logic according
to the actual network conditions, pre-emptively retrying requests that
exceed some latency percentile. This would hopefully solve the problem
where our timeouts are too long on mainnet and too slow on testnet.
@teor2345
Copy link
Contributor

Sounds good.

Did you want a review, even though the tests are failing?
(I think the failure is spurious - the retry timeout test is obsoleted by the download set changes.)

@teor2345

This comment has been minimized.

@teor2345 teor2345 closed this Oct 21, 2020
@teor2345
Copy link
Contributor

Oops, sorry, we're missing the final commit here in #1041.

@teor2345 teor2345 reopened this Oct 21, 2020
@hdevalence
Copy link
Contributor Author

I'd rather just close this and wait until being finished with rebasing #1041.

@hdevalence hdevalence closed this Oct 21, 2020
/// their connection state.
///
/// (ObtainTips failures use the sync restart timeout.)
const TIPS_RETRY_TIMEOUT: Duration = Duration::from_secs(60);
/// Controls how long we wait to restart syncing after finishing a sync run.
///
/// This timeout should be long enough to:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment needs updating, but I don't know enough about the details of #1041 and #1089 to make an accurate suggestion.

Copy link
Contributor

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good - there are some comments that need updating.

I haven't tested it yet.

@hdevalence hdevalence mentioned this pull request Oct 22, 2020
@teor2345 teor2345 deleted the hedge branch March 21, 2022 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants